grf-labs / grf

Generalized Random Forests
https://grf-labs.github.io/grf/
GNU General Public License v3.0
957 stars 250 forks source link

Reduced selected.vars leads to larger max. uplift predictions #394

Closed ras44 closed 5 years ago

ras44 commented 5 years ago

causal_forest seems to produce better(more in line with other models) uplift predictions when using a reduced numbers of variables in selected.vars, with selected.vars defined as in the the README:

# Note: Forests may have a hard time when trained on very few variables
# (e.g., ncol(X) = 1, 2, or 3). We recommend not being too aggressive
# in selection.
selected.vars = which(forest.Y.varimp / mean(forest.Y.varimp) > 0.2)

tau.forest = causal_forest(X[, selected.vars], Y, W,
                           W.hat = W.hat, Y.hat = Y.hat,
                           tune.parameters = TRUE)

On a dataset that includes ~1000 features, a causal_forest trained on the top 100 vs top 30 selected.vars generally produces larger uplift when using the smaller feature set. The larger estimates are more in agreement with True-lift/Single-model, two-model/t-learners, and transformed outcome learners. At over 100 features, the uplift disappears almost completely.

I'm not sure I understand why this should be so given the nature of the causal forest algorithm. Any thoughts or suggestions would be greatly appreciated!

swager commented 5 years ago

If you only train the causal forest on the top 30 variables, then the forest is better able to focus on these variables -- and will better fit the signal if it is in fact supported on these 30 or so variables. On the other hand, with 100 variables, the forest needs to do more variable selection and so may end up regularizing more.

This filtering is giving the causal forest more information, because you're telling it to hone in on the 30 variables that mattered the most for the Y-model; whereas if you don't tell it anything then the causal forest doesn't have any "prior" on what features matter the most for modeling the CATE.

ras44 commented 5 years ago

Thanks for your feedback. I wonder if the selected.vars condition might be a valuable tuning parameter when cross-validating on the R-learner objective function. I believe you mention this topic in Estimating Treatment Effects with Causal Forests: An Application [https://arxiv.org/abs/1902.07409], when you reference Basu, Kumbier, Brown, and Yu (2018). Prior to reading of this technique, I used cv.glmnet to select an optimal subset of variables correlated with the outcome, however I don't have evidence illustrating that it is truly optimal in the estimation of the CATE.

swager commented 5 years ago

Yeah this is an interesting class of questions to investigate; however, I think the optimal thing to do may be rather problem specific. So I wouldn't feel ready to add this as an automatic tuning option (since it's fairly easy to implement tuning on number of variables if one wants to do so, and if we add an automatic option then people might start using it without considering whether it's the best approach for their problem).

ras44 commented 5 years ago

I agree that is likely quite problem specific. I imagine it would also be quite an expensive automatic tuning process to calculate W.hat and Y.hat over a broad range of values.

I was surprised by the behavior, though, probably because of my previous experience using an S- or T-learner with xgboost. Those seem to not degrade with the addition of uncorrelated variables, though that might also just be problem specific to my use-case. The uplift signals produced by a well-tuned causal forest appear much clearer, though.

jtibshirani commented 5 years ago

I'm going to close this, as it sounds like we don't plan to incorporate this functionality right now. If we get more feedback on this point, or our thinking changes, we can re-open and discuss further.