Closed rjordan-hdai closed 2 years ago
Thanks for the report! Could you provide a reproducible example? A simple case that does split like we'd expect:
set.seed(55414)
x1 <- rbinom(100, 1, .7)
y <- rnorm(100, 0, .5) + x1
dat <- data.frame(y, x1)
mod <- xrf(y ~ x1, data = dat,
xgb_control = list(nrounds = 5, max_depth = 2),
family = "gaussian", sparse = FALSE)
> mod$rules
# A tibble: 2 x 5
split_id rule_id feature split less_than
<chr> <chr> <chr> <dbl> <lgl>
1 0-0 r0_1 x1 0.5 TRUE
2 0-0 r0_2 x1 0.5 FALSE
I vaguely remember encountering this in the past (rules that always evaluate to true on the train set), but it was so infrequent and didn't appear to impact model accuracy that I chalked it up as xgboost
weirdness and moved on. If all your features are getting this treatment, that's more concerning.
Thanks for the report! Could you provide a reproducible example? A simple case that does split like we'd expect:
set.seed(55414) x1 <- rbinom(100, 1, .7) y <- rnorm(100, 0, .5) + x1 dat <- data.frame(y, x1) mod <- xrf(y ~ x1, data = dat, xgb_control = list(nrounds = 5, max_depth = 2), family = "gaussian", sparse = FALSE)
> mod$rules # A tibble: 2 x 5 split_id rule_id feature split less_than <chr> <chr> <chr> <dbl> <lgl> 1 0-0 r0_1 x1 0.5 TRUE 2 0-0 r0_2 x1 0.5 FALSE
The data I'm working with is proprietary. I may be able to remove identifiers and such. It's pretty large: hundreds of thousands of rows and ~ 1000 features. The issue is that the glmnet model is giving considerable weight to some of these rules (relatively large coefficients), and the question is: are they legit rules (with X_i > 0, even though the rule as represented is X_i >= very small negative number). Should say that these are happening in compound rules (involving more than one feature)
Here's where I would start with your data:
NA
s or other values)xgboost
model and look at diagnostics.
max_depth
(which may be too deep for some trees in the ensemble), where there's no longer a signal to fit on.You'll note that my suspicion is that the weirdness is originating in xgboost
, and xrf
is simply passing along these bogus rules (bundled with other useful ones) to glmnet
. Perhaps the quickest way to verify this would be to reduce your max_depth
to 1.
A quick look shows that 4 of the trouble features are among the top 5 in importance according to gain. Based on the context/problem being modeled, they should be important features. It is possible that max_depth is too large. I set it to 4. I will try max_depth = 1 as you suggest to see what happens
I'm using xrf on a problem that includes binary (0 or 1) predictor variables The rules involving the binary features all seem to take the form X_i >= -0.0000009537. Since the features have values 0 and 1, if we take such a rule at face value, it means nothing, since it will always be met. I'm assuming that this is some kind of numerical rounding issue and it really means that X_i > 0 (i.e., X_i = 1). Is that correct?