A correction for XGBoost classifying sparse one-hot 0s as "Missing"

See https://github.com/holub008/xrf/issues/2 for details.

This change fixes the issue by building feature metadata (using input training data) and then parsing out the model matrix factor levels as used by XGBoost. If the factor level is categorical, we reassign the split to .5 (matching dense behavior).

I tested this change with the following (census_income and auc are defined in the README): timing

system.time(m_xrf <- xrf(above_50k ~ ., census_income, family = 'binomial', 
             xgb_control = list(nrounds = 100, max_depth = 3)))
  user  system elapsed 
119.294   8.508 127.844 
system.time(m_xrf_dense <- xrf(above_50k ~ ., census_income, family = 'binomial', 
             xgb_control = list(nrounds = 100, max_depth = 3), sparse = FALSE))
   user  system elapsed 
286.516   2.099 288.708

correctness Making sure sparse and dense models are approximately equal (there were two split points present in the dense model at around the 125th split that weren't in the sparse model, leading to non-equality. looked benign):

all(m_xrf$rules[1:100,] == m_xrf_dense$rules[1:100,])             
[1] TRUE

And checking model accuracy (just on a train set):

auc(predict(m_xrf, census_income), census_income$above_50k == ' >50K')
0.9417397

auc(predict(m_xrf_dense, census_income), census_income$above_50k == ' >50K')
0.9433681

holub008 / xrf

A correction for XGBoost classifying sparse one-hot 0s as "Missing" #6