py-why / dowhy

DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
https://www.pywhy.org/dowhy
MIT License
7.01k stars 923 forks source link

Confounder identification and 'exhaustive-search' #880

Closed soya-beancurd closed 1 year ago

soya-beancurd commented 1 year ago

Version information: v0.7.1

Background: Hello! Thanks for the great package for Causal Inference as a whole!

I'm currently attempting to perform causal inference on rather large sets of data (~250k - 750k rows/entries and 150 - 600+ columns/features - 1 treatment & 1 outcome variable in all the datasets). At present, I am:

  1. Running causal discovery (i.e., LiNGAM) on the set of data to identify a causal graph
  2. Piping the resultant causal graph to dowhy's CausalModel's identify_effect(), specifically the 'exhaustive-search' method to identify confounders
  3. Pumping the set of confounders, outcome and treatment variables to both dowhy and econml separately to test out various causal estimation algorithms

Ask your question: I currently possess certain doubts with regards to identify_effect() which I would like to clarify here, if possible:

  1. Based on my understanding (from issues 248, 255 & 261), 'exhaustive-search' attempts to identify all the backdoor paths from the given causal graph. Is it therefore correct to assume that my full set of confounders (to all be piped to causal estimation) can take on the unique set of all variables from all available backdoor paths identified by 'exhaustive-search'?
  2. As dowhy's backdoor criterion also searches for Mediator Variables, Frondoor Criterion and Instrumental Variables (IVs), is it possible for a variable to both be an IV and part of a backdoor path (i.e., a confounder)? I have obtained on average 1-3 such overlapping variables based on the method explained in (1), and was wondering if (1) is still a valid way of identifying confounders?
  3. Similar to how K-nearest neighbours often suffer from the curse of dimensionality, where a variable is equally near (or far) to every other variable in a high dimensional space, would a causal graph with many variables (150 - 600+ in my case) suffer from a similar issue during identify_effect()? By 'similar issue', I'm referring to the possibility that a backdoor path in a highly complex graph might essentially involve most of the variables, such that the results from question 1 might just return all non-treatment & non-outcome variables as confounders.

Thank you very much!!

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 14 days with no activity.

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 7 days since being marked as stale.