In this example the test data has 10 levels unseen during training. The predictions should all be 0.0 since (according to gbmentry.cpp) the intent is that new levels are treated as missing:
The problem is that INTEGER(VECTOR_ELT(rCSplits, (int)adSplitCode[iCurrentNode])) is of length equal to the number of levels in the train data, and yet the program retrieves values at positions
11, 12, ..., 20 as these are the values of (int)dX in the test data. This can be easily verified by adding a printf. Surprisingly there is no segfault. The values at the addresses immediately following INTEGER(VECTOR_ELT(rCSplits, (int)adSplitCode[iCurrentNode])) appear in general not to be equal to -1 or 1 and so in general the program correctly uses the missing node. However, by random chance if there is a -1 or 1 the record will be scored at the left or right child respectively.
This is more illustrative:
|-> shouldn't be accessing here
|
[-1, -1, -1, -1, -1, 1, 1, 1, 1, 1, 3245, 1, 64, 2342, 93348, -34857, 82, -8634, 9, 239]
^by chance this is a 1, so level 12 goes to
the right child instead of the missing node
In this example the test data has 10 levels unseen during training. The predictions should all be 0.0 since (according to gbmentry.cpp) the intent is that new levels are treated as missing:
The problem is that
INTEGER(VECTOR_ELT(rCSplits, (int)adSplitCode[iCurrentNode]))
is of length equal to the number of levels in the train data, and yet the program retrieves values at positions11, 12, ..., 20
as these are the values of(int)dX
in the test data. This can be easily verified by adding aprintf
. Surprisingly there is no segfault. The values at the addresses immediately followingINTEGER(VECTOR_ELT(rCSplits, (int)adSplitCode[iCurrentNode]))
appear in general not to be equal to-1
or1
and so in general the program correctly uses the missing node. However, by random chance if there is a-1
or1
the record will be scored at the left or right child respectively.This is more illustrative: