grf-labs / grf

Generalized Random Forests
https://grf-labs.github.io/grf/
GNU General Public License v3.0
938 stars 250 forks source link

Outcome variable with highly right-skewed distribution with mass at zero #1375

Open robert702 opened 7 months ago

robert702 commented 7 months ago

Hello, I am trying to fit a causal forest on an outcome that has a highly right-skewed distribution and mass at zero. Then I want to sort the sample by deciles of predicted treatment effects and calculate the treatment effect on each decile. This would be done with the standard cross fitted procedure.

My intuition would be that given the characteristics of the outcome variable it would be ideal to use a poisson regression (glm with exponential link) to calculate treatment effects. But I don't think the current grf algorithm allows for that --- please correct me if I am wrong. More generally it is not clear to me how to think about poisson with high dimensional covariate spaces

As a second best I am thinking of training a standard causal forest. But then when it comes to calculating the treatment effects on each decile of predicted treatment effects, then I could use a Poisson regression there. I cannot think of any conceptual issue when doing that, but please correct me if I am wrong.

Then the only remaining challenge is that, if i had a normal outcome, i would calculate the treatment effect of each quintile with the aipw method. But since the outcome is so rightly skewed, I wonder if I can use a version of the aipw in which the outcome is predicted with a poisson regression, instead of the standard aipw procedure which i believe uses a random forest by default.... i.e. i would try to use a poisson based aipw estimator, instead of the standard aipw estimator of the average_treatment_effect function in grf. Would an approach like this make sense? or is there another way to think about rightly skewed outcomes in the context of causal forests?

Thank you,

R

erikcs commented 7 months ago

Hi @robert702 , there’s nothing wrong with causal forests and skewed data per se. In some settings, there could be a general identification issue though, for example, if an outcome is extremely rare.