py-why / EconML

ALICE (Automated Learning and Intelligence for Causation and Economics) is a Microsoft Research project aimed at applying Artificial Intelligence concepts to economic decision making. One of its goals is to build a toolkit that combines state-of-the-art machine learning techniques with econometrics in order to bring automation to complex causal inference problems. To date, the ALICE Python SDK (econml) implements orthogonal machine learning algorithms such as the double machine learning work of Chernozhukov et al. This toolkit is designed to measure the causal effect of some treatment variable(s) t on an outcome variable y, controlling for a set of features x.
https://www.microsoft.com/en-us/research/project/alice/
Other
3.82k stars 716 forks source link

Questions about causal analysis class in econML #697

Open kenneth-lee-ch opened 2 years ago

kenneth-lee-ch commented 2 years ago

I have some questions about the causal analysis class in econML.

  1. Does anyone know how to overcome this issue when I fit the model with the training data. I don't know where I can increase the number of iterations in this class:

    Objective did not converge. You might want to increase the number of iterations. Duality gap: 0.004085702084360321, tolerance: 0.0025136819966382977
  2. Does the causal analysis compute the causal effect of each covariate that has been passed to feature_inds one by one, meaning that the first covariate on that list will be used as treatment first and then rest will be controls and repeat the same process for every covariate that has been passed intofeature_inds? How does that work?

  3. How can you change the hypermeter that the model get fine-tuned on so that it can include wider range?

  4. What is the difference if I don't include certain features into feature_inds, will they still be used in the model?

  5. How is heterogeneity_inds different than feature_inds in the model? What if I include some of the heterogeneity_inds into feature_inds ?

  6. I know that the class uses causalforestDML at some points, is it ok to includes features in the data that is not controls between treatments and outcomes? Also, will it be problematic to include variables that are highly correlated?

  7. How does the model handle the potential interaction effects of the features?

kbattocchi commented 1 year ago

Sorry for the slow response. For 1) and 3), one goal of the class is to automate model-fitting so that users don't need to manage the fitting process themselves; as a result, we don't provide a way to vary the model-fitting process so there's unfortunately no way to override these as a user. If you have a use case you can share where our existing parameter ranges aren't good enough, we could consider widening them, but given the purpose of the class it is unlikely that we will support ways for the user to manipulate the models in any fine-grained way.

For your other questions, here's the basic idea: we are fitting one heterogeneous treatment effect model per "feature" (i.e. individual entry in feature_inds) where we consider that feature as a treatment T, the heterogeneity features corresponding to that feature as features X (which can affect the outcome, the treatment, and the strength of the treatment effect), and the rest of the dataset (whether or not they are included as other elements of feature_inds) as controls W (which can affect the level of the treatment or outcome, but not the strength of the treatment effect itself). (See https://econml.azurewebsites.net/spec/api.html for our general heterogeneous treatment effect setup in terms of Y, T, X, and W).

There is nothing wrong with including the same columns in both feature_inds and heterogeneity_inds; we always remove the column being considered as a feature from the set of heterogeneity features if necessary when building its model. Correlated or irrelevant controls should not be a problem for correctness (although estimates will be more accurate if irrelevant controls are not included); the main problem to be aware of would be including a downstream effect of a feature as a control (e.g. if you have raw age and quantized age, then putting raw age in feature_inds while having the quantized age as another column in your dataset would give hard-to-interpret results because we're controlling for something which is downstream of our treatment).

This is intended to represent a limited but easy-to-interpret causal graph, so we don't model feature interactions; we are just computing the direct effect of each feature one-by-one on the outcome.

Hope that helps, and let us know if there is still anything unclear.

kenneth-lee-ch commented 1 year ago

@kbattocchi , thank you! I have some follow-up questions.

  1. For my question 1, should I be concerned about the warning message?
  2. I don't quite understand what you mean here: "Correlated or irrelevant controls should not be a problem for correctness (although estimates will be more accurate if irrelevant controls are not included)" What is your correctness is based on if it is not based on accuracy?
  3. This question is slightly more about causalForest specifically, I know that in causal forest, it tries to compute the heterogenous treatment effect by taking the average difference of outcome between treated and untreated within each leaf, how does it split the treated and untreated group for continuous treatment variables?
  4. How should I interpret the resulting estimates of treatment effects from causalForestDML? Right now, I can only say whether there is a positive/negative effect on the outcome, but I cannot make statement like "a unit increase in X causes a unit increase in Y" like linear regression, right?
  5. Is it true that the CausalAnalysis class only becomes causalForestDML when I set heterogeneity_model="forest"?
  6. When I print out the table that lists all the causal estimates from Causal Anlaysis class, for the treatment variables that are categorical variables and have multiple categories, for example for a variable Animal takes 3 values "dog", "cat", "rat", I see the table shows a row Animal with dogvcat and dogvrat, but I don't see ratvcat, how should I interpret this?
kbattocchi commented 1 year ago
  1. It's hard to say; first of all, if you're using 'automl' for your first stage models, then if, say, Lasso converges poorly on your data then hopefully a different model will be selected and so it won't matter at all; however, even if the Lasso model is chosen (or you're using 'linear' instead of 'automl' for the nuisance models), then it's not clear that the model is fitting badly (the duality gap is higher than the tolerance for convergence, but it's not necessarily a bad fit).
  2. Your confidence intervals should still contain the ground truth most of the time, they'll just be wider; the point estimate will not generally ever be exactly right, but regardless of irrelevant controls the error should decrease towards zero as you add more observations, just more slowly than if you did not have them.
  3. CausalForestDML fits a causal forest to the first stage residuals, described in more detail here.
  4. The output dataframes from the _causal_effect() do have that same interpretation, of a linear effect of the treatment on the outcome.
  5. Yes, the heterogeneous treatment effect model will be LinearDML if heterogeneity_model is "linear" (the default), or CausalForestDML if it is "forest".
  6. One category is considered the baseline (by default, the value that sorts first, but this can be explicitly specified using the categories argument, where the first category in that column's list will be treated as the control), because effects can only be identified relatively, not in absolute terms for each category. You can compute the values of any transition that you care about from the ones that you already have: ratvcat would be dogvcat-dogvrat (going from rat to cat is the same as going from rat to dog and then from dog to cat, and rat to dog is the negative of dog to rat).
kenneth-lee-ch commented 1 year ago

@kbattocchi Thank you for your answer. I think what's left unclear to me still is question 3. How does causal forest know which one is the untreated group vs treated group when it comes to continuous treatment variables for estimating the heterogenous treatment effects within each leaf?

titubs commented 1 year ago

@kbattocchi Hi Keith, do you have a github example on how to apply causal analysis on a df please? it is unclear