py-why / dowhy

DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
https://www.pywhy.org/dowhy
MIT License
6.99k stars 925 forks source link

Unable to estimate causal effect with intermediary variable? #69

Open JonasRSV opened 5 years ago

JonasRSV commented 5 years ago

I am having some trouble understanding the errors.

Is it not supposed to be possible estimate the causal effect of a graph like this?

Screenshot 2019-08-09 at 15 39 07

Where the treatment in 'error_code' and cause is 'days_on_grace'

Here is what i try to do:

M = pd.DataFrame(
    {"error_code": [601, 501, 500, 400, 100], 
     'grace_period_length': [2, 5, 1, 4, 20], 
     'days_on_grace': [1, 4, 0, 3, 19]})

import networkx as nx

G = nx.DiGraph()

for n in list(pd.DataFrame(M[['error_code', 'grace_period_length', 'days_on_grace']])):
    G.add_node(n)

# Now add 'causes'

G.add_edge('error_code', 'grace_period_length')
G.add_edge('grace_period_length', 'days_on_grace')

gml = list(nx.generate_gml(G))

import dowhy
from dowhy.do_why import CausalModel

# Use graph
treatment = ['error_code']
outcomes = ['days_on_grace']
model = CausalModel(pd.DataFrame(M[['grace_period_length', 'error_code', 'days_on_grace']]), 
                    treatment, 
                    outcomes, 
                    graph="".join(gml))

identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)

identify_effect seem to always throw an error if the treatment does not have a direct edge to the cause. Why is this?

Error


KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/networkx/classes/digraph.py in remove_edge(self, u, v)
    732         try:
--> 733             del self._succ[u][v]
    734             del self._pred[v][u]

KeyError: 'days_on_grace'

During handling of the above exception, another exception occurred:

NetworkXError                             Traceback (most recent call last)
<ipython-input-98-5d361b5e14a2> in <module>
----> 1 identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
      2 
      3 print(identified_estimand)

/usr/lib/python3.6/dist-packages/dowhy/do_why.py in identify_effect(self, proceed_when_unidentifiable)
    120                                            self._estimand_type,
    121                                            proceed_when_unidentifiable=proceed_when_unidentifiable)
--> 122         identified_estimand = self.identifier.identify_effect()
    123 
    124         return identified_estimand

/usr/lib/python3.6/dist-packages/dowhy/causal_identifier.py in identify_effect(self)
     22         estimands_dict = {}
     23         causes_t = self._graph.get_causes(self.treatment_name)
---> 24         causes_y = self._graph.get_causes(self.outcome_name, remove_edges={'sources':self.treatment_name, 'targets':self.outcome_name})
     25         common_causes = list(causes_t.intersection(causes_y))
     26         self.logger.info("Common causes of treatment and outcome:" + str(common_causes))

/usr/lib/python3.6/dist-packages/dowhy/causal_graph.py in get_causes(self, nodes, remove_edges)
    164             for s in sources:
    165                 for t in targets:
--> 166                     new_graph.remove_edge(s, t)
    167         causes = set()
    168         for v in nodes:

/usr/local/lib/python3.6/dist-packages/networkx/classes/digraph.py in remove_edge(self, u, v)
    734             del self._pred[v][u]
    735         except KeyError:
--> 736             raise NetworkXError("The edge %s-%s not in graph." % (u, v))
    737 
    738     def remove_edges_from(self, ebunch):

NetworkXError: The edge error_code-days_on_grace not in graph.

I am sorry if this is the wrong forum to ask this question.

amit-sharma commented 5 years ago

Hey @JonasRSV , thanks for bringing up this example. This kind of an indirect effect graph is more commonly used for estimating causal mediation effects. Since DoWhy currently does not support mediation effects, so the code simply assumes existence of direct edge.

I can answer better if you don't mind providing more details about the goal of your analysis. Can you clarify the effect that you are trying to estimate? From the description, I understand that you want to estimate the effect of error_code on days_on_grace, but in the current graph there are no observed common causes (confounders) and thus it translates to problem with a cause, outcome and no confounders. Is that the correct interpretation?

JonasRSV commented 5 years ago

Yes mediation effect is what i was looking for. This was just an example.

I am looking forward for that feature!

sangyh commented 4 years ago

Hey, is this implemented now? can i do mediation analyses using dowhy?

To clarify, i have an edge between treatment and outcome as well as a mediator variable. So i am able to draw the graph.

amit-sharma commented 4 years ago

Not yet @sangyh. Can you share your causal graph and a motivating example of the effect that you want to calculate. Can work on adding it.

samou1 commented 3 years ago

I am also interested in the mediation analysis. In Pearl's book, my understanding is that mediation can be addressed by choosing whether to control for the mediator or not. I have the current DAG. Any thoughts on how to develop the mediation myself for the estimation problem?

Screen Shot 2020-09-04 at 7 36 23 PM
amit-sharma commented 3 years ago

@samou1 Are you looking to calculate the effect of LCD on T2D? Here's a way to do it.

amit-sharma commented 3 years ago

@sangyh @samou1 @JonasRSV Mediation effects are now supported in DoWhy! Do try it out and share your feedback. Here's a full example notebook. Summary There are two new estimand types in identify_effect:

For estimation, the implemented estimator is simple: it is a two stage linear regression estimator. But the API is general, you can specify a first_stage_model and a second_stage_model. Will be adding a non-linear estimator soon. Here's a code sample.

For the direct effect of treatment on outcome

# Natural direct effect (nde)
identified_estimand_nde = model.identify_effect(estimand_type="nonparametric-nde", 
                                            proceed_when_unidentifiable=True)
print(identified_estimand_nde)
import dowhy.causal_estimators.linear_regression_estimator
causal_estimate_nde = model.estimate_effect(identified_estimand_nde,
                                        method_name="mediation.two_stage_regression",
                                       confidence_intervals=False,
                                       test_significance=False,
                                        method_params = {
                                            'first_stage_model': dowhy.causal_estimators.linear_regression_estimator.LinearRegressionEstimator,
                                            'second_stage_model': dowhy.causal_estimators.linear_regression_estimator.LinearRegressionEstimator
                                        }
                                       )
print(causal_estimate_nde)

For the indirect effect of treatment on outcome

# Natural indirect effect (nie)
identified_estimand_nie = model.identify_effect(estimand_type="nonparametric-nie", 
                                            proceed_when_unidentifiable=True)
print(identified_estimand_nie)

causal_estimate_nie = model.estimate_effect(identified_estimand_nie,
                                        method_name="mediation.two_stage_regression",
                                       confidence_intervals=False,
                                       test_significance=False,
                                        method_params = {
                                            'first_stage_model': dowhy.causal_estimators.linear_regression_estimator.LinearRegressionEstimator,
                                            'second_stage_model': dowhy.causal_estimators.linear_regression_estimator.LinearRegressionEstimator
                                        }
                                       )
print(causal_estimate_nie)

The frontdoor criterion is also supported through the same two stage estimator. To use frontdoor, write:

import dowhy.causal_estimators.linear_regression_estimator
causal_estimate = model.estimate_effect(identified_estimand,
                                        method_name="frontdoor.two_stage_regression",
                                       confidence_intervals=False,
                                       test_significance=False,
                                        method_params = {
                                            'first_stage_model': dowhy.causal_estimators.linear_regression_estimator.LinearRegressionEstimator,
                                            'second_stage_model': dowhy.causal_estimators.linear_regression_estimator.LinearRegressionEstimator
                                        }
                                       )
print(causal_estimate)

For a full code example, you can check out the notebook on mediation effects with DoWhy: https://github.com/microsoft/dowhy/blob/master/docs/source/example_notebooks/dowhy_mediation_analysis.ipynb

sangyh commented 3 years ago

Hi Amit, thanks for the update and implementing this. To clarify, this is the Baron and Kenny approach to mediation and not pearl's approach? In this case, I would need some tests for linearity I presume.

amit-sharma commented 3 years ago

@sangyh yes, the estimator implements the Baron and Kenny approach. However the modeling and identification steps before it are done using Pearl's approach. So given a causal graph with mediation (and other confounders), DoWhy can find out the right variables to include in the regression formula.

I also plan to add the non-parametric estimator based on Pearl's identification results. That should be implemented in the coming weeks. The linear case was the simplest to implement, so I started with that.

sangyh commented 3 years ago

Thanks Amit. I realized i have a confounder causing the mediator and outcome variables, so afraid BK approach will not work. I will try implementing pearl's approach if you haven't already implemented this in DoWhy. In your comment to @samou1, what is 'each value of (BMI, G, A)' when all 3 are continuous variables?

amit-sharma commented 3 years ago

When all three are continuous variables, then the sum for each value of (BMI, G,A) becomes an integration over the same variables, weighted by the probability P(BMI, G, A). If integration is numerically difficult, you can discretize the variables to reasonable buckets and then try.

Unfortunately it may take a few weeks before the Pearlian non-parametric estimator is implemented. Do let me know how your implementation goes for this estimator @sangyh .

rudi-mac commented 5 months ago

Hi all!

I am struggling to identify the correct estimand when using multiple mediators: Screenshot 2024-03-15 at 11 34 54

I am using Gender_Male as a treatment and Hourly_Salary as an outcome. And I am interested in the natural direct vs. natural indirect effects. When running: model.identify_effect(estimand_type="nonparametric-nde"), I only get the estimand for ONE mediator, which seems to be randomly selected: Screenshot 2024-03-15 at 11 34 04

Can someone explain this behavior? Can dowhy not handle multiple mediators? Thank you very much in advance!