Inference on Data from Observational Study

mwelz / GenericML

R implementation of Generic Machine Learning Inference (Chernozhukov, Demirer, Duflo and Fernández-Val, 2020).

GNU General Public License v3.0

64 stars 14 forks source link

Inference on Data from Observational Study #8

Closed tetokawata closed 2 years ago

tetokawata commented 2 years ago

Thank you for the nice package. The package is totally useful for my research and teaching.

In my understanding, the current package can apply to the randomized experiment data. Do you have a plan to extend the package to work with observed data (for instance, implementing Jacob (2020) https://arxiv.org/abs/1911.02688 )?

mwelz commented 2 years ago

Thank you for your interest in our package. Indeed, GenericML in its current version is intended for randomized experiments only, since the method it implements, Chernozhukov et al. (2020), is designed for randomized experiments. In fact, GenericML() will throw a warning if it believes that the input data are not from a randomized experiment.

I see that extending the package to observational studies could be useful. Let me discuss this with my co-authors and I'll get back to you soon.

tetokawata commented 2 years ago

@mwelz Thanks for the feedback!!! Looking forward to future updates.

mwelz commented 2 years ago

Dear @tetokawata,

After consulting my co-authors, we have agreed that the paper you shared with us, Jacob (2020), does not seem at an advanced enough state to be implemented in the package: It is explicitly labeled as a draft and does not provide proofs for its claims. If there will be an updated version of the paper one day, we might reconsider. For now, the GenericML package is only intended for data from randomized experiments, hence using it on observational data could lead to wrong inference. Only if one has good reason to believe that the propensity scores are precisely estimated they may apply GenericML to observational data at their own risk.

All the best, Max

tetokawata commented 2 years ago

Dear @mwelz

Thank you for your consideration and discussion. I understand it well.

maruphossain commented 2 years ago

Is there a way to increase the range of the estimated propensity score? I am trying the apply genericML code in a difference-in-difference setup. I took the first difference of the outcome variable. However, for some observations, propensity scores go beyond the range of what genericML allows (0.35 – 0.65). Any suggestions? Thank you.

mwelz commented 2 years ago

Dear @maruphossain,

Thank you for your interest in our package. In general, propensity scores shouldn't be affected by transformations of the dependent variable Y because propensity scores Pr[D = 1 | X] are estimated without information on Y (D is the binary treatment assignment and X a vector of covariates). May I ask what you passed to the argument learner_propensity_score?

If you are confident that your data come from a randomized experiment, choosing learner_propensity_score = "constant" shouldn't throw the error (because this option takes as the propensity scores the empirical mean of D, which should be about 0.5 in a randomized experiment). The reason why GenericML() only allows for propensity scores in the interval [0.35, 0.65] is because the theory of the paper is valid only for randomized experiments, and we want to avoid that the method gets applied to situations it is not intended for (such as observational studies).

All the best, Max

maruphossain commented 2 years ago

Dear Max, thank you very much for the explanation. I used the "SVM" method to generate scores using a set of pre-intervention covariates. However, as mentioned before, some of the estimated scores crossed the pre-defined range [0.35, 065]. I have a quasi-experimental setup, where intervention was placed based on the location of the households. I took the first difference of the outcome variables. I controlled a set of pre-intervention variables, including the location, to estimated propensity scores.

Best regards, Marup

mwelz commented 2 years ago

If treatment assignment occurred non-randomly–which is the case in quasi-experiments–then the data do not come from a randomized experiment. Hence, I'm afraid that the theory of the paper may not be applicable to your situation, which seems to ultimately cause the warning.

Concretely, equation (2.3) in the paper (v5 on arxiv) assumes random treatment assignment conditional on the covariates and equation (2.5) requires the true propensity scores to be bounded away from 0 and 1. We have decided to be conservative on this bound by having GenericML() throw a warning if the estimated propensity scores are not inside [0.35, 0.65]. Hence, if some of your estimated propensity scores fall outside this range, there is reason to believe that the assumption in equation (2.3) might be violated in your situation, potentially rendering the theory of the paper invalid. If you nevertheless want to apply the methodology to your data despite this warning, you do so at your own risk.

All the best, Max

maruphossain commented 2 years ago

Dear Max, thank you very much for taking the time to explain the issue here. I agree with you on the importance of randomized experiments to apply this method. Best regards, Marup

mwelz commented 2 years ago

You're welcome!