Assessing treatment heterogeneity in instrumental_forest

Hi, everyone. Thank you for posting this fancy package. I want to test whether a heterogeneous treatment effect exists in my instrumental forest model. For your information, the sample size is around 2,300 and the number of covariates is around 40 in my data.

There are 2 major challenges that I am facing: 1) test_calibration function does not support instrumental_forest. While the function supports some forest models, it is impossible to find any treatment heterogeneity in my instrumental forest model using the function. Is there any technical difficulty in supporting instrumental_forest in the test_calibration function? I wonder if there would be any plans for updates as best_linear_projection started to support instrumental_forest recently.

2) Rank average treatment effect is unstable. Since I can not use the test_calibration function, I have tried to use the rank_average_treatment function. However, I found that the p-values vary significantly based on the parameters I am using. For example, if I change tune.parameters from ‘all’ to c('sample.fraction', 'mtry', 'min.node.size', 'alpha', 'imbalance.penalty'), the p-value increases from 0.06 to 0.97, or decreases from 0.59 to 0.20 depending on different Ys. Moreover, the p-value also varies a lot if I change the seed of the instrumental forest model(e.g., seed=123 to seed=119, and so on). The following is the code that I'm using:

set.seed(123, kind = "Mersenne-Twister", normal.kind = "Inversion", sample.kind = "Rejection") 
cf.priority <- instrumental_forest( X[train, ], Y[train], W[train], Z[train], 
num.trees = 50000, 
#tune.parameters = 'all', 
tune.parameters = c('sample.fraction', 'mtry', 'min.node.size', 'alpha', 'imbalance.penalty'), 
tune.num.trees = 4000, tune.num.reps = 250, tune.num.draws = 4500)
set.seed(123, kind = "Mersenne-Twister", normal.kind = "Inversion", sample.kind = "Rejection")

# Estimate AUTOC on held-out data.
cf.eval <- instrumental_forest( X[-train, ], Y[-train], W[-train], Z[-train], 
num.trees = 50000, 
#tune.parameters = 'all', 
tune.parameters = c('sample.fraction', 'mtry', 'min.node.size', 'alpha', 'imbalance.penalty'),
tune.num.trees = 4000, tune.num.reps = 250, tune.num.draws = 4500)

Can we say that if a certain hyperparameter set yields a low p-value, then the tuning is valid? I wonder if there is any rule of thumb in tuning these hyperparameters. Thank you for your time and all the work!

Best, Minje

Hi @minnnjecho,

The TOC/RATE can "subsume" the test_calibration exercise, so no plan for future IV support.
A significant held-out RATE suggests the tuned forest was able to detect some HTEs. As you've seen getting to that point may require some back-and-forth in modeling. There's no rule of thumb except highlighting that a) tuning to find signal is hard, b) there are many sources of randomness that can affect the final result (train/test, tuning grid draws, etc). For a) reducing the number of parameters to search over may help, as you've done. In b), for a fixed set of hyperparameters different forest seeds should give very similar results, however passing different seeds when tuning may naturally produce different results as the resulting "optimal" forest may be different (there's randomness in tuning, i.e. initial parameters are drawn randomly, tune.num.reps increases the grid of draws and could perhaps make it more "stable").

grf-labs / grf

Assessing treatment heterogeneity in instrumental_forest #1244