holub008 / xrf

eXtreme RuleFit (sparse linear models on XGBoost ensembles)
Other
44 stars 13 forks source link

A correction for XGBoost classifying sparse one-hot 0s as "Missing" #6

Closed holub008 closed 5 years ago

holub008 commented 5 years ago

See https://github.com/holub008/xrf/issues/2 for details.

This change fixes the issue by building feature metadata (using input training data) and then parsing out the model matrix factor levels as used by XGBoost. If the factor level is categorical, we reassign the split to .5 (matching dense behavior).

I tested this change with the following (census_income and auc are defined in the README): timing

system.time(m_xrf <- xrf(above_50k ~ ., census_income, family = 'binomial', 
             xgb_control = list(nrounds = 100, max_depth = 3)))
  user  system elapsed 
119.294   8.508 127.844 
system.time(m_xrf_dense <- xrf(above_50k ~ ., census_income, family = 'binomial', 
             xgb_control = list(nrounds = 100, max_depth = 3), sparse = FALSE))
   user  system elapsed 
286.516   2.099 288.708 

correctness Making sure sparse and dense models are approximately equal (there were two split points present in the dense model at around the 125th split that weren't in the sparse model, leading to non-equality. looked benign):

all(m_xrf$rules[1:100,] == m_xrf_dense$rules[1:100,])             
[1] TRUE

And checking model accuracy (just on a train set):

auc(predict(m_xrf, census_income), census_income$above_50k == ' >50K')
0.9417397

auc(predict(m_xrf_dense, census_income), census_income$above_50k == ' >50K')
0.9433681