grf-labs / grf

Generalized Random Forests
https://grf-labs.github.io/grf/
GNU General Public License v3.0
952 stars 247 forks source link

Best Linear Predictor Specifications, GRF & DML #712

Closed vinnyarmentano closed 3 years ago

vinnyarmentano commented 4 years ago

Hi GRF Team,

I am interested in understanding the differences between the Best Linear Predictor (BLP) specifications employed by GRF’s test_calibration() command and Chernozhukov et al.’s "Generic Machine Learning Inference on Heterogenous Treatment Effects in Randomized Experiments." arXiv preprint arXiv:1712.04802 (2017)”.

I understand from the test_calibration() help file that there is a slightly different aim for the GRF BLP specification than the Chernozhukov et al. BLP, with Beta1 (mean.forest.prediction) being a test of the ability of the forest to have identified an ATE. Whereas with the Beta1 from the Chernozhukov et al. BLP, the coefficient itself should be the ATE in the same units as the dependent variable.

Referencing the code provided for Dr. Duflo’s 2018 NBER SI Lecture for the DML BLP and the test_calibration() function’s code as the CF BLP there are 5 differences I have been able to identify. (These differences will be listed below)

My questions of interest are;

  1. I would like to understand how these alterations transform the Beta1 to the test of forest effectiveness from the ATE but leave the Beta2 in the same state of being a heterogeneity test.
  2. I would also like to know if I would be violating any assumptions by taking Causal Forest-produced CATE estimates and employing them in the BLP form exactly as it’s specified in Chernozhukov et al (2017).
  3. More generically, I would like to know if there are situations where it’s better to employ one form of the BLP specification than the other. Are there any qualitative differences between what these two tests are designed to capture?

The 5 differences I’m interested in understanding, with relevant lines for each method;

  1. Dependent variable: The DML uses the observed Y value while GRF uses the difference between the observed Y and a YHat as predicted by a random forest on the entire sample. The DML uses the only the observed Y value as the dependent variable. (DML Line 273 & CF Lines 51 & 61)
  2. Additional terms: The DML includes in its paper an additional “a’X1” term, which is a stand in for a YHat as predicted by a random forest only run on the control sample. In the DML code, the CATE itself is also included in the specification. GRF includes neither of these terms in its model. (DML Line 273 & CF line 61)
  3. mean.forest.prediction aka Beta1 construction: The DML simply subtracts observations’ propensity of being treated while GRF multiplies that by a weighted mean of observations CATEs. (DML Line 271 & CF line 52)
  4. differential.forest.prediction aka Beta2 term construction: The DML uses a simple mean of CATEs when demeaning CATEs while GRF uses a weighted mean according to each observations inverse propensity of being treated. (DML Line 269 & CF lines 49 & 53)
  5. Constant Term: The DML includes a constant term in the specification while GRF does not include a constant term. (DML Line 273 & CF line 61)

I hope all of the above helps define the context of my questions. I’m more than happy to clarify any of the above thoughts if it’ll be of assistance. I’m really looking forward to understanding this. Thank you in advance for your help.

Best, Vinny Armentano

References: Double Machine Learning Reference Links here;

  1. Dr. Duflo’s Lecture at the 2018 NBER SI
  2. Associated Slides a. Slide 41 of particular interest
  3. Associated Code

The test_calibration() code I referenced is from here.

I also wrote a brief test script that runs both methods’ BLP specifications on the simulated dataset created by the example in the test_calibration() help file. It is attached to this issue. The two different BLP specifications result in similarly interpreted Beta2 terms. Both also come to the conclusion that an ATE is present. BestLinearPredictor_GRF_DML.txt

vinnyarmentano commented 4 years ago

(Bringing this to the attention of @isabelonate & @dsantamaria94)

swager commented 4 years ago

Thanks for reaching out. The main difference between Duflo's DML tutorial and our test_calibration function is that we assume we're in an observational study where the propensity score need to be estimated, whereas Duflo works in a randomized study setting where the propensity score is known (usually constant). In response to the specific questions:

  1. Subtracting Y.hat arises from using an "orthogonal" moment condition that's robust to errors in the propensity score. This is an application of the transformation of Robinson (1988).
  2. Yes, adding a'X1 as predictors is very similar to subtracting a Y.hat estimate obtained via OLS.
  3. The idea here is that, if the average of tau(Xi) was a good estimate of the average effect, then we'd want \beta_1 = 1 (i.e., we want to check for calibration of the average). Duflo wants \beta_1 to be the average effect. All that changes is the "scale" of \beta_1.
  4. Here I think there may be a misunderstanding. observation_weight is not the same thing as an inverse-propensity weight; and in fact observation_weight is typically 1 for everyone. We use observation weights when we want to add in an additional layer of weighting (e.g., if some observations are censored, so we want to do inverse-probability of censoring weighting with the data we have), but that usually only comes up in special cases.
  5. Yes, the idea is that subtracting out Y.hat voids the need for a constant term. Adding in a constant term shouldn't be harmful, though.

Overall, if you're in an RCT (i.e., working with known propensities), then doing the calibration check as in Duflo's slides should be perfectly fine -- and would presumably give very similar results to what we get using test_calibration.

hhsievertsen commented 3 years ago

Dear @swager and @vinnyarmentano,

thank you for asking this question, @vinnyarmentano. I had a very similar question.

The modification of Chernozhukov et al.'s approach to "work" in observational studies seems very useful (and I am currently working on applying it). Have you mentioned that in a paper we could cite? Or shall we just cite grf?

Thanks a lot for developing these tools and answering questions!

Hans