uber / causalml

Uplift modeling and causal inference with machine learning algorithms
Other
5.01k stars 771 forks source link

How to deal with observational data? #713

Closed NHUV closed 7 months ago

NHUV commented 10 months ago

Hello. I would like to create an uplift model to prioritize the best customers to contact. Since there is observational data available, I prefer to go that way as it's less time consuming than setting up an experiment. Are there any suggestions on how to deal with observational data (e. g. in order to adhere to the unconfoundedness assumption)? I am thinking about incorporating the following methods:

  1. Propensity score matching. However, I do wonder if this is already applied under the hood when using an x-learner for example since I saw that there is a propensity score generated in case the user didn't provide this. Can you elaborate here? I didn't find an implementation example for psm in the notebooks, is it correct that such example is not available?
  2. As (29. Shortreed, S.M., Ertefaie, A.: Outcome-adaptive lasso: variable selection for causal inference. Biometrics 73(4), 1111–1122 (2017)) states: "Feature selection algorithms for observational causal inference, such as the lasso-based approach proposed by [29], are designed to help models whose goal is to reduce confounding". Is there a reason this method is not incorporated in the causalML package? Does a filter feature selection method suffice in case we apply matching (as touched upon in 1.).

Really looking forward to your recommendations for developing an uplift model with observational data.

Thank you!

t-tte commented 10 months ago

In observational causal inference, the most important step is that of forming a clear understanding of the possible confounding variables for the causal relationship that you are trying to measure. As things stand, this can only be done by qualitatively reasoning about the specific problem that you're trying to solve, ideally with other people who are also knowledgeable of the problem. Causal ML or any other current software packages can't help you with this.

Once you've defined your set of confounding variables, you can use any of the variety of estimation methods out there. The most common one is a simple linear multiple regression with the confounders as covariates. You can use statsmodels, DoWhy, etc. The methods implemented in Causal ML (like X-learner, R-learner) will also work, but they're most likely an overkill.