Representation of rules involving binary features?

holub008 / xrf

eXtreme RuleFit (sparse linear models on XGBoost ensembles)

Other

43 stars 13 forks source link

Representation of rules involving binary features? #17

Closed rjordan-hdai closed 2 years ago

rjordan-hdai commented 3 years ago

I'm using xrf on a problem that includes binary (0 or 1) predictor variables The rules involving the binary features all seem to take the form X_i >= -0.0000009537. Since the features have values 0 and 1, if we take such a rule at face value, it means nothing, since it will always be met. I'm assuming that this is some kind of numerical rounding issue and it really means that X_i > 0 (i.e., X_i = 1). Is that correct?

holub008 commented 3 years ago

Thanks for the report! Could you provide a reproducible example? A simple case that does split like we'd expect:

  set.seed(55414)
  x1 <- rbinom(100, 1, .7)
  y <- rnorm(100, 0, .5) + x1
  dat <- data.frame(y, x1)
  mod <- xrf(y ~ x1, data = dat,
                 xgb_control = list(nrounds = 5, max_depth = 2),
                 family = "gaussian", sparse = FALSE)

> mod$rules
# A tibble: 2 x 5
  split_id rule_id feature split less_than
  <chr>    <chr>   <chr>   <dbl> <lgl>    
1 0-0      r0_1    x1        0.5 TRUE     
2 0-0      r0_2    x1        0.5 FALSE

holub008 commented 3 years ago

I vaguely remember encountering this in the past (rules that always evaluate to true on the train set), but it was so infrequent and didn't appear to impact model accuracy that I chalked it up as xgboost weirdness and moved on. If all your features are getting this treatment, that's more concerning.

rjordan-hdai commented 3 years ago

Thanks for the report! Could you provide a reproducible example? A simple case that does split like we'd expect:

  set.seed(55414)
  x1 <- rbinom(100, 1, .7)
  y <- rnorm(100, 0, .5) + x1
  dat <- data.frame(y, x1)
  mod <- xrf(y ~ x1, data = dat,
                 xgb_control = list(nrounds = 5, max_depth = 2),
                 family = "gaussian", sparse = FALSE)

> mod$rules
# A tibble: 2 x 5
  split_id rule_id feature split less_than
  <chr>    <chr>   <chr>   <dbl> <lgl>    
1 0-0      r0_1    x1        0.5 TRUE     
2 0-0      r0_2    x1        0.5 FALSE

The data I'm working with is proprietary. I may be able to remove identifiers and such. It's pretty large: hundreds of thousands of rows and ~ 1000 features. The issue is that the glmnet model is giving considerable weight to some of these rules (relatively large coefficients), and the question is: are they legit rules (with X_i > 0, even though the rule as represented is X_i >= very small negative number). Should say that these are happening in compound rules (involving more than one feature)

holub008 commented 3 years ago

Here's where I would start with your data:

Verify that your binary features are in fact binary (0/1 only, no NAs or other values)
Fit an xgboost model and look at diagnostics.
- What's the weight/cover/gain on the trouble features (https://towardsdatascience.com/interpretable-machine-learning-with-xgboost-9ec80d148d27)? Are they providing signal?
- Are the trees it's fitting including these trouble features at the bottom of the trees? I'm not familiar with the implementation details of xgb, but it's possible that your xgb parameters are forcing it to identify splits at max_depth (which may be too deep for some trees in the ensemble), where there's no longer a signal to fit on.

You'll note that my suspicion is that the weirdness is originating in xgboost, and xrf is simply passing along these bogus rules (bundled with other useful ones) to glmnet. Perhaps the quickest way to verify this would be to reduce your max_depth to 1.

rjordan-hdai commented 3 years ago

A quick look shows that 4 of the trouble features are among the top 5 in importance according to gain. Based on the context/problem being modeled, they should be important features. It is possible that max_depth is too large. I set it to 4. I will try max_depth = 1 as you suggest to see what happens