Closed holub008 closed 5 years ago
Note this IS fixed by setting sparse=FALSE
in xrf() (default TRUE
):
> m_xrf$rules
# A tibble: 1,996 x 5
split_id rule_id feature split less_than
<chr> <chr> <chr> <dbl> <lgl>
1 0-0 r0_7 marital_status Married-civ-spouse 0.5 TRUE
2 0-0 r0_8 marital_status Married-civ-spouse 0.5 TRUE
3 0-0 r0_9 marital_status Married-civ-spouse 0.5 TRUE
4 0-0 r0_10 marital_status Married-civ-spouse 0.5 TRUE
5 0-0 r0_11 marital_status Married-civ-spouse 0.5 FALSE
6 0-0 r0_12 marital_status Married-civ-spouse 0.5 FALSE
7 0-0 r0_13 marital_status Married-civ-spouse 0.5 FALSE
8 0-0 r0_14 marital_status Married-civ-spouse 0.5 FALSE
9 0-1 r0_7 capital_gain 7074. TRUE
10 0-1 r0_8 capital_gain 7074. TRUE
Closed by https://github.com/holub008/xrf/pull/6
@yama1968 FYI, probably worth updating your installation for with this change.
Done it, thanks! Yannick
Le jeu. 18 avr. 2019 à 06:18, Karl Holub notifications@github.com a écrit :
@yama1968 https://github.com/yama1968 FYI, probably worth updating your installation for with this change.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/holub008/xrf/issues/2#issuecomment-484350223, or mute the thread https://github.com/notifications/unsubscribe-auth/ABY3PHYLAETWQ27RSMOHNC3PQ7ZCXANCNFSM4HBKYD2A .
See: https://github.com/dmlc/xgboost/issues/1112
When setting the xrf 'sparse' parameter in xrf (using a sparse design matrix), xgboost reports zero entries (i.e. sparse regions) as missing. It then reports splits on one hot encoded categorical features as values outside the range [0-1] - i.e. the rule is always true and doesn't contribute a signal to the model.
This bug is watering down the quality of the fitted GLM.
Example:
produces:
where -.00000000954 is clearly not a meaningful split for the one-hot encoded "married_status" feature.