Incorrect prediction with new levels due to invalid memory addresses

d <- data.frame(x=as.factor(1:20), y=1:20)

train <- d[1:10,]
test  <- d[11:20,]

p <- rep(0, 10)

while(sum(abs(p)) == 0)
{
    g <- gbm(y ~ x,
             distribution="gaussian",
             bag.fraction=1,
             data=train, 
             n.trees=1,
             shrinkage=1,
             n.minobsinnode=1)

    p <- predict(g, newdata=test, n.trees=1) - g$initF
}

pretty.gbm.tree(g, 1)
#  SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction
#0        0           0.0        1         2           3           62.5     10        0.0
#1       -1          -2.5       -1        -1          -1            0.0      5       -2.5
#2       -1           2.5       -1        -1          -1            0.0      5        2.5
#3       -1           0.0       -1        -1          -1            0.0     10        0.0

g$c.splits[[1]]
#[1] -1 -1 -1 -1 -1  1  1  1  1  1

print(p)
#[1] 0.0 2.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

In this example the test data has 10 levels unseen during training. The predictions should all be 0.0 since (according to gbmentry.cpp) the intent is that new levels are treated as missing:

iCatSplitIndicator = INTEGER(
            VECTOR_ELT(rCSplits,
                       (int)adSplitCode[iCurrentNode]))[(int)dX];
if(iCatSplitIndicator==-1)
{
   iCurrentNode = aiLeftNode[iCurrentNode];
}
else if(iCatSplitIndicator==1)
{
   iCurrentNode = aiRightNode[iCurrentNode];
}
else // categorical level not present in training
{
   iCurrentNode = aiMissingNode[iCurrentNode];
}

The problem is that INTEGER(VECTOR_ELT(rCSplits, (int)adSplitCode[iCurrentNode])) is of length equal to the number of levels in the train data, and yet the program retrieves values at positions 11, 12, ..., 20 as these are the values of (int)dX in the test data. This can be easily verified by adding a printf. Surprisingly there is no segfault. The values at the addresses immediately following INTEGER(VECTOR_ELT(rCSplits, (int)adSplitCode[iCurrentNode])) appear in general not to be equal to -1 or 1 and so in general the program correctly uses the missing node. However, by random chance if there is a -1 or 1 the record will be scored at the left or right child respectively.

This is more illustrative:

                                    |-> shouldn't be accessing here
                                    |
[-1, -1, -1, -1, -1, 1, 1, 1, 1, 1, 3245, 1, 64, 2342, 93348, -34857, 82, -8634, 9, 239]
                                          ^by chance this is a 1, so level 12 goes to 
                                           the right child instead of the missing node

harrysouthworth / gbm

Incorrect prediction with new levels due to invalid memory addresses #46