Model evaluation: how well does a model predict treatment effects?

PolinaKoroleva commented 5 years ago

In the traditional supervised machine learning to evaluate prediction accuracy we use actual values of the response variable. In the heterogeneous treatment effect modeling, ground truth is not available. Some alternative ways for assessing the model performance include Qini Curves and Qini coefficients, calibration plot (Guelman 2014). To build calibration plot,(1) we obtain ATE predictions on test set, (2) rank order them and group into bins with equal observations each, (3) plot the average predicted versus the average actual treatment for each bin. These methods are often used in the uplift package in R, but I have never seen such implementation for GRF. Is there any reason why such metrics are not in use?

In GRF forest models support estimation of variance and standard error. Can such standard error estimate be used as to compare prediction accuracy of different models?

Also the test_calibration command is available and gives the following results:

> test_calibration(c.forest60)

Best linear fit using forest predictions (on held-out data)
as well as the mean forest prediction as regressors, along
with heteroskdasticity-robust (HC3) SEs:

                              Estimate Std. Error t value  Pr(>|t|)    
mean.forest.prediction          0.89133    0.11227  7.9391 2.399e-15 ***
differential.forest.prediction  1.96851    0.14707 13.3846 < 2.2e-16 ***

From these results I conclude that the mean forest prediction is correct, because it is close enough to 1, and differential.forest.prediction is >1, that suggests that forest has captured heterogeneity. Is my interpretation correct? And can this metric be used to evaluate the prediction accuracy of the model?

swager commented 5 years ago

We don't implement exactly those functions; however, this note has code snippets that can be used for similar purposes, Estimating Treatment Effects with Causal Forests: An Application [https://arxiv.org/abs/1902.07409].

For examples on how to validate the predictive accuracy of a CATE model, see Quasi-Oracle Estimation of Heterogeneous Treatment Effects [https://arxiv.org/abs/1712.04912] (especially Section 2).

Based on the output of the test_calibration function, it looks like in this example the CATE detection is significant, but also a little over-regularized (i.e., true variation in CATE may be larger than the variation in the forest estimates). This is a case where local linear forests may help.

ras44 commented 5 years ago

@PolinaKoroleva In case it helps:

I have found the Qini Curves and calibration plots to be unreliable and subject to inflation due to the weightings used in the calculation of the pred variable in the R uplift package. This has also been commented on by others. See Gutierrez and Gerardy 2016, and https://tech.wayfair.com/2018/10/pylift-a-fast-python-package-for-uplift-modeling/ section "Evaluation metrics" which state:

Because Nt and Nc do not depend on φ, if the treatment/control imbalance is not random as a function of φ, the Qini curve can be artificially inflated.

@swager @jtibshirani would a Qini-plot or similar visualization be a feature of interest? I would be happy to contribute an initial PR if so.

swager commented 5 years ago

@ras44 I'd be happy to talk more about Qini plots if you're interested in contributing a PR. Two things to think about are:

How to address overfitting. Clearly, if we threshold on non-out-of-bag CATE estimates, then the value of targeting will be overinflated. But thresholding on out-of-bag CATE estimates should help. (In any case, it'd be important to examine how any such plot behaves under the null of no treatment heterogeneity.)
How to do this in observational studies. If treatment propensities are not constant, then everything needs to be propensity weighted (and, preferably, should by doubly robust -- like in the ATE function).

ras44 commented 5 years ago

@swager Certainly, I'd be happy to talk more. As suggested, there is likely a bit of depth to the topic.

In addition to the points you mentioned, a user might also want to estimate CATE/plot Qini on a new dataset (eg. a randomized or observation period two weeks after the training period), to see if the model is holding or if CATE is time-dependent.

If there is a vision for a first example, I'd be happy to try to implement it. I can also try to provide some toy models and plots via a PR to illustrate some of the issues I've run into with using traditional Qini plots (particularly in observational cases where T/C are not equally weighted).

Perhaps such a PR might be a good forum for further discussion? And though we could share code/ideas and discuss via the PR, there is no expectation on my end that it will actually be merged anytime soon :)

swager commented 5 years ago

Sounds good, @ras44, thanks! Happy to take a look if you have ideas about how to build this.

ras44 commented 5 years ago

@swager I've provided a start to this in the dev branch of the following repo:

https://github.com/ras44/uplifteval/tree/dev

I've implemented a few things:

the uplift chart created by the R uplift package- one of the oldest uplift-related R packages
an R port of the pylift package's plotting functionality

In the vignettes:

I've illustrated one scenario that can arise with the uplift package related to distributions that are not sufficiently smooth (finding quantiles will result in a warning). The uplift package output is quite limited and only uses one particular choice of weightings.
The pylift port provides a variety of scores and graph options that can be useful. I've illustrated with an example.

Perhaps the pylift port would be a possibly jumping-off point?

Re non-constant treatment propensities: Pylift does include a p parameter which they call "policy", but is related to treatment propensities. I believe the W.hat predictions from the regression forest could be used here.

Looking forward to your comments.

halflearned commented 4 years ago

This discussion seem to have become stale. I will close it for now, please reopen if there are any developments.

ras44 commented 4 years ago

Thanks @halflearned - I've been distracted by other commitments but hope to be able to contribute again soon. I've really enjoyed using this package.

For reference, adding Qini plots and other visualizations may be related somewhat to work in #420 .

grf-labs / grf

Model evaluation: how well does a model predict treatment effects? #395