imbs-hl / ranger

A Fast Implementation of Random Forests
http://imbs-hl.github.io/ranger/
772 stars 193 forks source link

Bad length read when loading a saved ranger_out.forest file #566

Closed georgeslabreche closed 3 years ago

georgeslabreche commented 3 years ago

I'm using the cpp version. Training with the following command:

./ranger --verbose --file training_data.csv --depvarname LABEL --treetype 1 --ntree 10 --write

And predicting with:

./ranger --verbose --file test_data.csv --predict ranger_out.forest

The prediction seems to work but there's a silent failure going on with reading ranger_out.forest when fetching the length value here:

The first 10 lines of hexdump on ranger_out.forest gives the following:

0000000: 0001 0000 0005 0000 0000 0000 414c 4542
0000010: 0a4c 0000 0000 0000 0600 0000 0000 0000
0000020: 0100 0101 0101 0601 0000 0000 0000 0100
0000030: 0000 0200 0000 0000 0000 0000 0000 0000
0000040: f000 003f 0000 0000 0000 0200 0000 0000
0000050: 0000 1900 0000 0000 0000 0100 0000 0000
0000060: 0000 0300 0000 0000 0000 0000 0000 0000
0000070: 0000 0500 0000 0000 0000 0700 0000 0000
0000080: 0000 0900 0000 0000 0000 0000 0000 0000
0000090: 0000 0000 0000 0000 0000 0000 0000 0000

Here is a zip containing the csv and ranger_out.forest files: ranger_georges.zip

Despite this the fetched tree type is still 1 so it's a silent failure. However, when I compile for an ARM32 environment it seems like the proper length value is fetched (?) but this leads to an erroneous tree type value being read:

Here's the first 10 lines of the hexdump of the ranger_out.forest file produced in the ARM32 environment:

0000000 0001 0000 0005 0000 414c 4542 0a4c 0000
0000010 0600 0000 0100 0101 0101 0601 0000 0100
0000020 0000 0200 0000 0000 0000 0000 f000 003f
0000030 0000 0000 0000 0200 0000 0d00 0000 0100
0000040 0000 0300 0000 0000 0000 0500 0000 0700
0000050 0000 0900 0000 0000 0000 0b00 0000 0000
0000060 0000 0000 0000 0000 0000 0000 0000 0000
0000070 0000 0d00 0000 0200 0000 0400 0000 0000
0000080 0000 0600 0000 0800 0000 0a00 0000 0000
0000090 0000 0c00 0000 0000 0000 0000 0000 0000
georgeslabreche commented 3 years ago

I've investigated this a bit further and I believe that the file read process in ArgumentHandler::checkArguments() to fetch and set the treetype value isn't representative of how the file is saved: https://github.com/imbs-hl/ranger/blob/dc9323f57e05d65aa4d51d60b0221d1d770ac4f4/cpp_version/src/utility/ArgumentHandler.cpp#L457-L473

Here's how the file is actually saved.

This can cause treetype to be set to a wrong value which will at best cause a segmentation fault and at worst silently fail. Also, it's a bit odd that we would set a value in a function that is only meant to check arguments. If anything we should only check if the given treetype argument matches the value saved in file? Then again, why would we want to give the treetype as an argument and not just always fetch it from the saved file?

mnwright commented 3 years ago

Thanks! Could you confirm that #570 fixes the issue?

Also, it's a bit odd that we would set a value in a function that is only meant to check arguments.

True, that's bad design and probably the reason for introducing the bug in the first place (didn't expect reading from the file here).

Then again, why would we want to give the treetype as an argument and not just always fetch it from the saved file?

We use the same checkArguments() etc. for training and prediction. In training we want to set the treetype, in prediction we want to read it from the file.

In any case, I'll keep this one open because we should improve the design here.

georgeslabreche commented 3 years ago

Thanks for looking into this @mnwright. It runs successfully on my end so it looks good to me!