Closed gabeguo closed 4 years ago
@gabeguo the message "No such variable found" for instrumental variable is standard behavior. It means that during the identification step, both types of identification were tried out (backdoor and IV), and no IV variable was found. During estimation, however the backdoor identification is used since you provided propensity score stratification.
There may be a number of reasons why you are getting a NaN value for the causal estimate. A few ideas to debug,
If the problem persists, can you provide a small dataset example on which you see the bug? You can share a synthetic dataset too. That can help us debug.
@amit-sharma I tried using the propensity-based methods for a small synthetic dataset, and I still got NaN as the output. Why does it give NaN?
Here is the synthetic dataset:
a,b,c,d,e True,2,3,10,15 False,5,7,-3,-6 True,8,2,3,5 False,-3,8,98,10 True,21,6,7,12 False,7,12,45,3
Here is the code:
import numpy as np import pandas as pd import logging
import dowhy from dowhy import CausalModel
df = pd.read_csv('fake_data.csv')
print(df)
model=CausalModel( data = df, treatment='a', outcome='c', graph="digraph {a->c; b->a; b->c; d->a; d->c; e->a; e->c;}" ) model.view_model()
from IPython.display import Image, display display(Image(filename="causal_model.png"))
identified_estimand = model.identify_effect() print(identified_estimand)
""" causal_estimate_reg = model.estimate_effect(identified_estimand, method_name="backdoor.linear_regression", test_significance=True) print(causal_estimate_reg) print("Causal Estimate is " + str(causal_estimate_reg.value)) """
causal_estimate_strat = model.estimate_effect(identified_estimand, method_name="backdoor.propensity_score_stratification", target_units="ate") print(causal_estimate_strat) print("Causal Estimate is " + str(causal_estimate_strat.value))
Here is the terminal output:
a b c d e
0 True 2 3 10 15 1 False 5 7 -3 -6 2 True 8 2 3 5 3 False -3 8 98 10 4 True 21 6 7 12 5 False 7 12 45 3 INFO:dowhy.causal_graph:If this is observed data (not from a randomized experiment), there might always be missing confounders. Adding a node named "Unobserved Confounders" to reflect this. INFO:dowhy.causal_model:Model to find the causal effect of treatment ['a'] on outcome ['c']
@gabeguo thank you for providing a reproducible example. I understand the problem now.
The method outputs nan
whenever there is not enough data points in each stratum for each treatment.
Details: The propensity score stratification method divides the dataset into strata based on the value of the propensity score, and within each strata there should be a minimum number of data points for treatment=1 and treatment=0. It removes any stratum that does not have the minimum number of data points.
Based on the dataset, we need to change the default parameters num_strata
(number of strata to divide the dataset into) and clipping_threshold
(minimum number of datapoints per stratum per treatment value). Therefore, to obtain a valid causal estimate, for each value of the treatment and each strata, there should be at least clipping_threshold
number of data points.
Example: In your fake_data example, try setting clipping_threshold=0
and num_strata=1
.
causal_estimate_strat = model.estimate_effect(identified_estimand, method_name="backdoor.propensity_score_stratification", target_units="ate", method_params={'clipping_threshold':0, 'num_strata':1})
Still, for this tiny dataset, you will notice the above change does not work. While it leads to at least one strata having more than clipping_threshold
number of data points, each strata has data points with either treatment=0 or treatment=1 (really confounded treatment assignment). In such a case, it is not possible to calculate the effect.
So let's change the dataset slightly so that propensity scores are not that extreme and each strata has both treatment=1 and treatment=0.
True,2,3,10,15
False,2,2,10,15
True,8,2,3,5
False,8,3,3,6
True,21,6,7,12
False,20,4,6,11
Now you will see that your script works with this fake data.
For your real dataset, I suggest decreasing the value of num_strata
(default=50) and seeing if it works. I won't suggest changing clipping_threshold
since the default is already low (10).
To help future use, I have added more meaningful error messages for this method. You can pull the latest code from master to see those error messages. Instead of nan
, the method now outputs:
This is the output I got: ` Causal Estimate
Identified estimand
Estimand type: nonparametric-ate
Estimand : 1
Estimand name: backdoor Estimand expression: d
────────────────(Expectation(Year_MS_diagnosed|BMI,Age,Fatigue,General_health, d[MedianBedTime]
Bed_durationmins,Depression,DaytimeFunction))
Estimand assumption 1, Unconfoundedness: If U→{MedianBedTime} and U→Year_MS_diagnosed then P(Year_MS_diagnosed|MedianBedTime,BMI,Age,Fatigue,General_health,Bed_durationmins,Depression,DaytimeFunction,U) = P(Year_MS_diagnosed|MedianBedTime,BMI,Age,Fatigue,General_health,Bed_durationmins,Depression,DaytimeFunction)
Estimand : 2
Estimand name: iv No such variable found!
Realized estimand
b: Year_MS_diagnosed~MedianBedTime+BMI+Age+Fatigue+General_health+Bed_durationmins+Depression+DaytimeFunction Target units: att
Estimate
Mean value: nan
Causal Estimate is nan `
Why am I getting 'nan' for the estimate, and why does it say 'No such variable found!' for the instrumental variable, when doing propensity score stratification?