grf-labs / grf

Generalized Random Forests
https://grf-labs.github.io/grf/
GNU General Public License v3.0
938 stars 250 forks source link

cross-fitted calibration test - Athey et.al. 2024 #1407

Closed lucy-temed closed 3 months ago

lucy-temed commented 3 months ago

In this paper: https://arxiv.org/pdf/2310.08672.pdf

The authors report the results of a "cross-fitted" calibration test: "we randomly divide the sample into ten folds, and for an observation i in a given fold estimate their CATE τ(Xi) by τhat(Xi) using a causal forest τhat fitted only on the other folds. We then run the calibration regression by fold and aggregate the resulting coefficient and standard error estimates"

  1. How are the regression coefficients of different folds aggregated?

  2. I've always wondered if the OOB were enough to run a calibration test, or if the calibration test was meant to be run in the testing sample. This paper suggests that indeed we should be running calibration tests in the test sample (not in the training sample). This would be consistent with the standard practice for RATE. Can you confirm that the calibration test should be run on a test sample, or if it is ok to run it on the training sample?

Thanks.

erikcs commented 3 months ago

1) It's probably best to contact the paper authors if you want exact details of how something is implemented 2) Yes, as RATE.

lucy-temed commented 3 months ago

Thanks very much for your response @erikcs ! Given your answer to (2) above, it would be useful to update the function test_calibration, which currently only works with OOB. To actually do the calibration test with cross-fitted predictions, one needs to do basically re-code the function manually. Further, given that one of the authors in the paper I mentioned is also an author of GRF, it just seems natural that the procedure they describe as a correct use of the calibration test, could be discussed in this forum. Note that, as the authors do in the paper I cited (Athey et.al. 2024) the use of calibration is complementary to RATE. I know for this type of questions you usually refer people to RATE. That is for sure useful, but it just seems that, following Athey et.al. 2024, it would be useful to add reference to a proper use of test_calibration, or to modify the function to allow the user to directly provide predictions instead of requiring a causal forest and access its oob predictions --- thinking of something like rank_average_treatment_effect_fit.

Thanks so much again!

erikcs commented 3 months ago

@lucy-temed, yes that makes sense. We'd prefer to devote developer resources to one main preferred evaluation tool, RATE in this case. Then, if some users really want something complimentary (maybe to appeal to an econ/other audiences unfamiliar with RATE), they can put that together themselves.

erikcs commented 3 months ago

Hi @lucy-temed, I caught up with @susanathey and believe Jann Spiess is planning a replication repo for the paper you mentioned that would include code that illustrates how to run their analysis using grf.

We could let you know once that becomes available? It will probably be a while. Then, if you and other users find this useful, we can look into integrating such functionality into the grf package!