grf-labs / grf

Generalized Random Forests
https://grf-labs.github.io/grf/
GNU General Public License v3.0
973 stars 250 forks source link

Number of observation to report, how many? #1463

Open ninetale opened 1 month ago

ninetale commented 1 month ago

Hello,

I am currently utilizing the causal forest package and have some questions regarding the observations. Let’s assume there are 1,000 observations.

I am using the following code:

cf <- causal_forest(X, Y, W, Y.hat = Y_hat_re, W.hat = W_hat_re, honesty = TRUE, tune.parameters = "all", sample.fraction = 0.8, num.trees = 20000)

  1. From my understanding, this would allocate 400 observations to both the training and estimation sets. Is my understanding correct?

  2. Furthermore, should I assume that the 'test_calibration(cf)' function operates based on a test set of 200 observations?

  3. Lastly, I would like to know how many observations the best_linear_projection(cf, X, target.sample = "overlap") function targets. *I understand that "overlap" implies weighting, not the exclusion of observations.

Thank you for developing and maintaining such a useful package.

erikcs commented 1 month ago

Hi @ninetale, the effective number of samples used for estimation in both 1 and 2 is n=1000, but in 1 you set aside honest.fraction*sample.fraction=400 for "honest" splitting. 3: yes, but weights can be zero.

ninetale commented 1 month ago

Thanks for the reply @erikcs Then, my understanding is correct.

That is,

in ‘cf <- causal_forest(X, Y, W, Y.hat = Y_hat_re, W.hat = W_hat_re, honesty = TRUE, tune.parameters = “all”, sample.fraction = 0.8, num.trees = 20000)’

It uses 400 observations for forest construction and 400 for estimation. And since 200 are held-out observations, I guess ‘test_calibration(cf)’ uses those 200.

ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ

However, the question I still have is

“Does ‘best_linear_projection(cf, X, target.sample = “overlap”)’ use all the first 1,000 observations?”. (Of the 1,000 observations, 400 for model construction, 400 for estimation, and 200 for heteroscedasticity validity analysis. And all 1,000 again for the heterogeneity of effects analysis?)

If not, do I have to divide the original 1,000 for best_linear_projection as following steps?

For example:

  1. divide the observations into 600,400 each.
  2. use 200 of the 600 for forest construction and 200 for estimation (in ‘causal_forest’)
  3. use the remaining 200 out of 600 for ‘test_calibration’.
  4. Observe the heterogeneity of the effects through the 400 initial splits.

Should we follow this procedure?

There is confusion as the procedure differs somewhat from machine learning or deep learning for prediction.

Thank you.