Incorrect categorical split rule derivation from xgboost model built on sparse data

holub008 commented 5 years ago

See: https://github.com/dmlc/xgboost/issues/1112

When setting the xrf 'sparse' parameter in xrf (using a sparse design matrix), xgboost reports zero entries (i.e. sparse regions) as missing. It then reports splits on one hot encoded categorical features as values outside the range [0-1] - i.e. the rule is always true and doesn't contribute a signal to the model.

This bug is watering down the quality of the fitted GLM.

Example:

library(RCurl)
library(xrf)

# grabbing data from uci
census_income_text <- getURL('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data')
census_income <- read.csv(textConnection(census_income_text), header=F, stringsAsFactors = F)
colnames(census_income) <- c('age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status',
                            'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss',
                            'hours_per_week', 'native_country', 'above_50k')

m_xrf <- xrf(above_50k ~ ., census_income, family = 'binomial', 
             xgb_control = list(nrounds = 100, max_depth = 3))

head(m_xrf$rules)

produces:

 split_id rule_id feature                                   split less_than
  <chr>    <chr>   <chr>                                     <dbl> <lgl>    
1 0-0      r0_11   marital_status Married-civ-spouse  -0.000000954 FALSE    
2 0-0      r0_12   marital_status Married-civ-spouse  -0.000000954 FALSE    
3 0-0      r0_13   marital_status Married-civ-spouse  -0.000000954 FALSE    
4 0-0      r0_14   marital_status Married-civ-spouse  -0.000000954 FALSE    
5 0-2      r0_11   education_num                      12.5         TRUE     
6 0-2      r0_12   education_num                      12.5         TRUE

where -.00000000954 is clearly not a meaningful split for the one-hot encoded "married_status" feature.

holub008 commented 5 years ago

Note this IS fixed by setting sparse=FALSE in xrf() (default TRUE):

> m_xrf$rules
# A tibble: 1,996 x 5
   split_id rule_id feature                            split less_than
   <chr>    <chr>   <chr>                              <dbl> <lgl>    
 1 0-0      r0_7    marital_status Married-civ-spouse    0.5 TRUE     
 2 0-0      r0_8    marital_status Married-civ-spouse    0.5 TRUE     
 3 0-0      r0_9    marital_status Married-civ-spouse    0.5 TRUE     
 4 0-0      r0_10   marital_status Married-civ-spouse    0.5 TRUE     
 5 0-0      r0_11   marital_status Married-civ-spouse    0.5 FALSE    
 6 0-0      r0_12   marital_status Married-civ-spouse    0.5 FALSE    
 7 0-0      r0_13   marital_status Married-civ-spouse    0.5 FALSE    
 8 0-0      r0_14   marital_status Married-civ-spouse    0.5 FALSE    
 9 0-1      r0_7    capital_gain                      7074.  TRUE     
10 0-1      r0_8    capital_gain                      7074.  TRUE

holub008 commented 5 years ago

Closed by https://github.com/holub008/xrf/pull/6

holub008 commented 5 years ago

@yama1968 FYI, probably worth updating your installation for with this change.

yama1968 commented 5 years ago

Done it, thanks! Yannick

Le jeu. 18 avr. 2019 à 06:18, Karl Holub notifications@github.com a écrit :

@yama1968 https://github.com/yama1968 FYI, probably worth updating your installation for with this change.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/holub008/xrf/issues/2#issuecomment-484350223, or mute the thread https://github.com/notifications/unsubscribe-auth/ABY3PHYLAETWQ27RSMOHNC3PQ7ZCXANCNFSM4HBKYD2A .

holub008 / xrf

Incorrect categorical split rule derivation from xgboost model built on sparse data #2