py-why / dowhy

DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
https://www.pywhy.org/dowhy
MIT License
7.01k stars 924 forks source link

Problem interpreting 95.0% confidence interval in backdoor.linear_regression #326

Open juandavidgutier opened 2 years ago

juandavidgutier commented 2 years ago

Hello,

I am trying to estimate the effect of El Niño on incidence of leishmaniasis. I used the method "backdoor.linear_regression" with test_significance=True and confidence_intervals=True. However, when I see the value of the confidence interval [[1.02988048 2.0855936 ]], the interval does not contain the mean value of the estimate (2.8158204337251664). I am confuse about it because I hoped that the confidence interval should include the mean value of the estimate.

Can anyone help me to understand what is happening?

I appreciate the cooperation

Here my dataset data.csv

And here my code

import os, warnings, random import dowhy import econml from dowhy import CausalModel import pandas as pd import numpy as np

El Nino vs Neutral

data_nino = pd.read_csv("data") data_nino = data_nino.dropna()

data_leish_nino = data_nino.drop(['Codigo.DANE.periodo','Codigo.DANE', 'consensoENSO'], axis=1) data_leish_nino.head() data_leish_nino = data_leish_nino.astype({"TF_consenso":'bool'}, copy=False)

colombia

colombia_nino = data_leish_nino

Step 1: Modeling the causal mechanism

model_leish=CausalModel( data = colombia_nino, treatment=['TF_consenso'], outcome='incidencia100k', common_causes=['SST3.4'], effect_modifiers=['bosques'], frontdoor=['Temperature', 'Rainfall'], graph= "digraph {SST3.4->TF_consenso;SST3.4->incidencia100k;SST3.4->Temperature;SST3.4->Rainfall;TF_consenso->Temperature;TF_consenso->Rainfall;TF_consenso->incidencia100k;Temperature->incidencia100k;Rainfall->incidencia100k;bosques->incidencia100k;}" )

view model

model_leish.view_model()

Step 2: Identifying effects

identified_estimand = model_leish.identify_effect(proceed_when_unidentifiable=True) print(identified_estimand)

Step 3: Estimation of the effect

ate, significance and confidence interval

estimate_bd = model_leish.estimate_effect(identified_estimand, method_name="backdoor.linear_regression", test_significance=True, confidence_intervals=True)

print(estimate_bd)

amit-sharma commented 2 years ago

This is odd. I can try to look at this, but it may take some time.

juandavidgutier commented 2 years ago

@amit-sharma Thanks for the cooperation

jmafoster1 commented 2 years ago

I just had a similar thing with my own data. If you use the get_confidence_intervals method of the CausalEstimate class with argument method="bootstrap", that might return more sensible values. It did for me.

juandavidgutier commented 2 years ago

@jmafoster1 Great!!! Thanks for the tip.

juandavidgutier commented 2 years ago

@jmafoster1 I followed your advice but in a new dataset I found the same problem related with that the interval (0.1192 - 0.2268) does not contain the mean value of the estimate (9.689e-17). I don't know if the difficulty can be generated by the small mean value?

I am using this line of code to estimate the CI:

dml_estimate_soiltemp = model_leish.estimate_effect(identified_estimand_soiltemp, target_units = "ate",

test_significance=True,

                                #confidence_intervals=True,
                                method_name="backdoor.econml.dml.DML",
                                method_params={
                                    'init_params': {'model_y':GradientBoostingRegressor(),
                                                    'model_t': GradientBoostingRegressor(),
                                                    'featurizer':PolynomialFeatures(degree=1, include_bias=True),
                                                    'model_final':LassoCV(fit_intercept=False),
                                                    'random_state':123},
                                    'fit_params': {'inference': BootstrapInference(n_bootstrap_samples=25, n_jobs=-1),
                                                   }
                                 })

confidence interval with boostrap soiltemp

ci_Colombia_boost_soiltemp = dml_estimate_soiltemp.get_confidence_intervals(method="bootstrap", confidence_level=0.95, num_simulations=10, sample_size_fraction=0.7) print(ci_Colombia_boost_soiltemp)

jmafoster1 commented 2 years ago

I'm afraid I don't know how the confidence intervals code works, but it looks like you're using EconML as your estimator. I think they have their own methods to calculate confidence intervals. See https://microsoft.github.io/dowhy/example_notebooks/dowhy-conditional-treatment-effects.html#CATE-Object-and-Confidence-Intervals for details.