Why should we use predicted ITE to assess heterogeneity in treatment effects, not doubly-robust estimates?

grf-labs / grf

Generalized Random Forests

GNU General Public License v3.0

938 stars 250 forks source link

Hi, I have a question regarding assessment of heterogeneity in treatment effects using GRF. I understood that there are three ways to evaluate heterogeneity with GRF: (1) calibration test based on best linear predictor appraoch, (2) GATE (Athey & Wager, 2019; in Algorithm 1) and (3) RATE test (Yadlowsky et al., 2021). For my research, I tried to apply all approaches for testing existence of heterogeneity in treatment effects.

My question is specifically about the second one, GATE. In 'Algorithm 1' of Athey & Wager (2019), samples are grouped in terms of their LOO-predicted individual treatment effect (ITE) using 'predict' function. I wonder why doubly-robust estimates of ITE (available by 'get_scores' function) was not used here. Is it just for sanity check of the model performance in terms of correspondence between prediction and estimation? Or are there any other considerations in this prediction-based grouping?

p.s.) Since there is only tag 'bug report', I upload this question with this tag.

Thanks for raising this question -- this is a question that comes up fairly regularly in class also. The key point here is that doubly robust scores (i.e., the quantities that when averaged yield the AIPW estimate of the average treatment effect) are not accurate estimates of the CATE -- rather, they are low-bias but very very noisy estimates of the CATE (with order-one variance even in large samples).

The implication is that each doubly robust score is essentially useless on its own (it's mostly noise), and in particular can't be used to estimate the CATE of the unit the score was constructed for (if one needs estimates of the CATE at specific points, one should just use predictions from the causal forest).

Where the doubly robust scores are useful is -- when you take many units, the noise averages out, and the low-bias property then shines through. Doubly robust scores should only be used in this way, i.e., averaged across large groups of samples. A single doubly robust score on its own is not interpretable, and should not be used for categorizing units.

grf-labs / grf

Why should we use predicted ITE to assess heterogeneity in treatment effects, not doubly-robust estimates? #1372