This change fixes the issue by building feature metadata (using input training data) and then parsing out the model matrix factor levels as used by XGBoost. If the factor level is categorical, we reassign the split to .5 (matching dense behavior).
I tested this change with the following (census_income and auc are defined in the README):
timing
system.time(m_xrf <- xrf(above_50k ~ ., census_income, family = 'binomial',
xgb_control = list(nrounds = 100, max_depth = 3)))
user system elapsed
119.294 8.508 127.844
system.time(m_xrf_dense <- xrf(above_50k ~ ., census_income, family = 'binomial',
xgb_control = list(nrounds = 100, max_depth = 3), sparse = FALSE))
user system elapsed
286.516 2.099 288.708
correctness
Making sure sparse and dense models are approximately equal (there were two split points present in the dense model at around the 125th split that weren't in the sparse model, leading to non-equality. looked benign):
See https://github.com/holub008/xrf/issues/2 for details.
This change fixes the issue by building feature metadata (using input training data) and then parsing out the model matrix factor levels as used by XGBoost. If the factor level is categorical, we reassign the split to .5 (matching dense behavior).
I tested this change with the following (census_income and auc are defined in the README): timing
correctness Making sure sparse and dense models are approximately equal (there were two split points present in the dense model at around the 125th split that weren't in the sparse model, leading to non-equality. looked benign):
And checking model accuracy (just on a train set):