Closed PolinaKoroleva closed 4 years ago
We don't implement exactly those functions; however, this note has code snippets that can be used for similar purposes, Estimating Treatment Effects with Causal Forests: An Application [https://arxiv.org/abs/1902.07409].
For examples on how to validate the predictive accuracy of a CATE model, see Quasi-Oracle Estimation of Heterogeneous Treatment Effects [https://arxiv.org/abs/1712.04912] (especially Section 2).
Based on the output of the test_calibration
function, it looks like in this example the CATE detection is significant, but also a little over-regularized (i.e., true variation in CATE may be larger than the variation in the forest estimates). This is a case where local linear forests may help.
@PolinaKoroleva In case it helps:
I have found the Qini Curves and calibration plots to be unreliable and subject to inflation due to the weightings used in the calculation of the pred
variable in the R uplift package. This has also been commented on by others. See Gutierrez and Gerardy 2016, and https://tech.wayfair.com/2018/10/pylift-a-fast-python-package-for-uplift-modeling/ section "Evaluation metrics" which state:
Because Nt and Nc do not depend on φ, if the treatment/control imbalance is not random as a function of φ, the Qini curve can be artificially inflated.
@swager @jtibshirani would a Qini-plot or similar visualization be a feature of interest? I would be happy to contribute an initial PR if so.
@ras44 I'd be happy to talk more about Qini plots if you're interested in contributing a PR. Two things to think about are:
@swager Certainly, I'd be happy to talk more. As suggested, there is likely a bit of depth to the topic.
In addition to the points you mentioned, a user might also want to estimate CATE/plot Qini on a new dataset (eg. a randomized or observation period two weeks after the training period), to see if the model is holding or if CATE is time-dependent.
If there is a vision for a first example, I'd be happy to try to implement it. I can also try to provide some toy models and plots via a PR to illustrate some of the issues I've run into with using traditional Qini plots (particularly in observational cases where T/C are not equally weighted).
Perhaps such a PR might be a good forum for further discussion? And though we could share code/ideas and discuss via the PR, there is no expectation on my end that it will actually be merged anytime soon :)
Sounds good, @ras44, thanks! Happy to take a look if you have ideas about how to build this.
@swager I've provided a start to this in the dev
branch of the following repo:
https://github.com/ras44/uplifteval/tree/dev
I've implemented a few things:
In the vignettes:
Perhaps the pylift port would be a possibly jumping-off point?
Re non-constant treatment propensities: Pylift does include a p
parameter which they call "policy", but is related to treatment propensities. I believe the W.hat
predictions from the regression forest could be used here.
Looking forward to your comments.
This discussion seem to have become stale. I will close it for now, please reopen if there are any developments.
Thanks @halflearned - I've been distracted by other commitments but hope to be able to contribute again soon. I've really enjoyed using this package.
For reference, adding Qini plots and other visualizations may be related somewhat to work in #420 .
In the traditional supervised machine learning to evaluate prediction accuracy we use actual values of the response variable. In the heterogeneous treatment effect modeling, ground truth is not available. Some alternative ways for assessing the model performance include Qini Curves and Qini coefficients, calibration plot (Guelman 2014). To build calibration plot,(1) we obtain ATE predictions on test set, (2) rank order them and group into bins with equal observations each, (3) plot the average predicted versus the average actual treatment for each bin. These methods are often used in the uplift package in R, but I have never seen such implementation for GRF. Is there any reason why such metrics are not in use?
In GRF forest models support estimation of variance and standard error. Can such standard error estimate be used as to compare prediction accuracy of different models?
Also the
test_calibration
command is available and gives the following results:From these results I conclude that the mean forest prediction is correct, because it is close enough to 1, and
differential.forest.prediction
is >1, that suggests that forest has captured heterogeneity. Is my interpretation correct? And can this metric be used to evaluate the prediction accuracy of the model?