py-why / dowhy

DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
https://www.pywhy.org/dowhy
MIT License
7.15k stars 935 forks source link

The result of Mediation analysis with DoWhy #403

Open Jlujiaotong opened 2 years ago

Jlujiaotong commented 2 years ago

Recently, I use Dowhy for some research.

But when I want to find the direct and indirect effects between variables based on the mediation analysis, I meet a problem.

I got different results when I run my code twice using the same data set.

import pandas as pd import dowhy from dowhy import CausalModel import dowhy.datasets import econml import warnings import dowhy.causal_estimators.linear_regression_estimator warnings.filterwarnings('ignore')

def causal_estimate(treatment, data, outcome, G, path): f = open(path + treatment + outcome + ".txt", "w+") f.close()

构建因果模型

    model = CausalModel(data=data,
                    treatment=treatment,
                    outcome=outcome,
                    graph=G.replace("\n", " "),
                    missing_nodes_as_confounders=True)
with open(path + treatment + outcome + ".txt", "a") as f:
    print("####### Model ##############################################################", file=f)
    print("Common Causes:", model._common_causes, file=f)
    print("Effect Modifiers:", model._effect_modifiers, file=f)
    print("Instruments:", model._instruments, file=f)
    print("Outcome:", model._outcome, file=f)
    print("Treatment:", model._treatment, file=f)
    print("############################################################################", file=f)
# 发现因果关系
    # Natural direct effect (nde)
identified_estimand_nde = model.identify_effect(estimand_type="nonparametric-nde",
                                                proceed_when_unidentifiable=True)
# Natural indirect effect (nie)
identified_estimand_nie = model.identify_effect(estimand_type="nonparametric-nie",
                                                proceed_when_unidentifiable=True)

with open(path + treatment + outcome + ".txt", "a") as f:
    print("####### Natural direct effect ################################################", file=f)
    print(identified_estimand_nde, file=f)
    print("############################################################################", file=f)
with open(path + treatment + outcome + ".txt", "a") as f:
    print("####### Natural indirect effect ################################################", file=f)
    print(identified_estimand_nie, file=f)
    print("############################################################################", file=f)

causal_estimate_nde = model.estimate_effect(identified_estimand_nde,
                                    method_name="mediation.two_stage_regression",
                                   confidence_intervals=False,
                                   test_significance=False,
                                    method_params = {
                                        'first_stage_model': dowhy.causal_estimators.linear_regression_estimator.LinearRegressionEstimator,
                                        'second_stage_model': dowhy.causal_estimators.linear_regression_estimator.LinearRegressionEstimator
                                    })
with open(path + treatment + outcome + ".txt", "a") as f:
    # Linear Results
    print("####### Natural direct effect #################################################", file=f)
    print("*** Class Name ***", file=f)
    print(causal_estimate_nde.params['estimator_class'], file=f)
    print("*** Treatment Name ***", file=f)
    print(model._treatment, file=f)
    print(causal_estimate_nde, file=f)
    print("############################################################################", file=f)
causal_estimate_nie = model.estimate_effect(identified_estimand_nie,
                                    method_name="mediation.two_stage_regression",
                                   confidence_intervals=False,
                                   test_significance=False,
                                    method_params = {
                                        'first_stage_model': dowhy.causal_estimators.linear_regression_estimator.LinearRegressionEstimator,
                                        'second_stage_model': dowhy.causal_estimators.linear_regression_estimator.LinearRegressionEstimator
                                    })
with open(path + treatment + outcome + ".txt", "a") as f:
    # Linear Results
    print("####### Natural indirect effect #################################################", file=f)
    print("*** Class Name ***", file=f)
    print(causal_estimate_nie.params['estimator_class'], file=f)
    print("*** Treatment Name ***", file=f)
    print(model._treatment, file=f)
    print(causal_estimate_nie, file=f)
    print("############################################################################", file=f)

path_1 = 'F:/数据+疫情/各个国家政策对出行的影响(图)/新尝试/'

分组

df = pd.read_excel('sample_test_7.xlsx', index_col='date') df = df.fillna('bfill') df = df.sort_index()

columns = ['RAR', 'R', 'TS', 'W', 'ITC', 'CPE', 'RG', 'SC', 'SHR', 'WC', 'PD', 'GPC', 'T', 'CR', 'C', 'TP', 'DR', 'D'] df.columns = columns

G = """digraph { RAR[label="Retail and recreation"]; R[label="Residential"]; TS[label="Transit stations"]; W[label="workplaces"]; ITC[label="ITC"]; CPE[label="Close public transport"]; RG[label="Restriction gatherings"]; SC[label="School closures"]; SHR[label="Stay Home Requirements"]; WC[label="Workplace Closures"]; PD[label="Population density"]; GPC[label="Gdp per capita"]; T[label="time"]; TP[label="Testing policy"]; DR[label="Deaths rate"]; D[label="Deaths"]; T -> D; D -> RG; D -> SC; D -> SHR; D -> WC; D -> CPE; D -> ITC; D -> R; D -> RAR; D -> TS; D -> W; D -> DR; PD -> RG; PD -> SC; PD -> SHR; PD -> WC; PD -> CPE; PD -> ITC; GPC -> RG; GPC -> SC; GPC -> SHR; GPC -> WC; GPC -> CPE; GPC -> ITC; RG -> R; RG -> RAR; RG -> TS; RG -> W; RG -> DR; SC -> R; SC -> RAR; SC -> TS; SC -> W; SC -> DR; SHR -> R; SHR -> RAR; SHR -> TS; SHR -> W; SHR -> DR; WC -> R; WC -> RAR; WC -> TS; WC -> W; WC -> DR; CPE -> R; CPE -> RAR; CPE -> TS; CPE -> W; CPE -> DR; ITC -> R; ITC -> RAR; ITC -> TS; ITC -> W; ITC -> DR; R -> DR; RAR -> DR; TS -> DR; W -> DR; TP -> DR}""" G_1 = """digraph { RAR[label="Retail and recreation"]; R[label="Residential"]; TS[label="Transit stations"]; W[label="workplaces"]; ITC[label="ITC"]; CPE[label="Close public transport"]; RG[label="Restriction gatherings"]; SC[label="School closures"]; SHR[label="Stay Home Requirements"]; WC[label="Workplace Closures"]; PD[label="Population density"]; GPC[label="Gdp per capita"]; T[label="time"]; TP[label="Testing policy"]; CR[label="Cases rate"]; C[label="Cases"]; T -> C; C -> RG; C -> SC; C -> SHR; C -> WC; C -> CPE; C -> ITC; C -> R; C -> RAR; C -> TS; C -> W; C -> CR; PD -> RG; PD -> SC; PD -> SHR; PD -> WC; PD -> CPE; PD -> ITC; GPC -> RG; GPC -> SC; GPC -> SHR; GPC -> WC; GPC -> CPE; GPC -> ITC; RG -> R; RG -> RAR; RG -> TS; RG -> W; RG -> CR; SC -> R; SC -> RAR; SC -> TS; SC -> W; SC -> CR; SHR -> R; SHR -> RAR; SHR -> TS; SHR -> W; SHR -> CR; WC -> R; WC -> RAR; WC -> TS; WC -> W; WC -> CR; CPE -> R; CPE -> RAR; CPE -> TS; CPE -> W; CPE -> CR; ITC -> R; ITC -> RAR; ITC -> TS; ITC -> W; ITC -> CR; R -> CR; RAR -> CR; TS -> CR; W -> CR; TP -> CR}""" policies = ['ITC', 'CPE', 'RG', 'SC', 'SHR', 'WC'] for policy in policies: causal_estimate(policy, df, 'DR', G, path_1) causal_estimate(policy, df, 'CR', G_1, path_1)

amit-sharma commented 2 years ago

This might be due to a random seed issue. Can you share a small reproducible code example with your dataset? If the dataset is confidential, you can also share some simulated data so that I can reproduce the problem on my computer.

adityalahiri commented 2 years ago

Is there some way for me to set custom values of treatment_value and control_value for mediation analysis? I am trying the sample notebook but I want different values than 0 and 1. Thank you!

Jlujiaotong commented 2 years ago

Thank you for your interest in this issue.

This is a sample data set for my code.

Now I am trying to implement the mediation analysis by a two-step calculation. For example, first estimate the effect of x on z and then estimate the effect of z on y. And finally multiply them together. 

Is this the correct approach?

------------------ 原始邮件 ------------------ 发件人: "microsoft/dowhy" @.>; 发送时间: 2022年4月17日(星期天) 下午3:01 @.>; @.**@.>; 主题: Re: [microsoft/dowhy] The result of Mediation analysis with DoWhy (Issue #403)

This might be due to a random seed issue. Can you share a small reproducible code example with your dataset? If the dataset is confidential, you can also share some simulated data so that I can reproduce the problem on my computer.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Jlujiaotong commented 2 years ago

Thank you for your interest in this issue.

This is a sample data set for my code.

------------------ 原始邮件 ------------------ 发件人: "microsoft/dowhy" @.>; 发送时间: 2022年4月22日(星期五) 上午8:06 @.>; @.**@.>; 主题: Re: [microsoft/dowhy] The result of Mediation analysis with DoWhy (Issue #403)

Is there some way for me to set custom values of treatment_value and control_value for mediation analysis? I am trying the sample notebook but I want different values than 0 and 1. Thank you!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>