Why do propensity-based methods give nan for the estimate?

gabeguo commented 4 years ago

This is the output I got: ` Causal Estimate

Identified estimand

Estimand type: nonparametric-ate

Estimand : 1

Estimand name: backdoor Estimand expression: d
────────────────(Expectation(Year_MS_diagnosed|BMI,Age,Fatigue,General_health, d[MedianBedTime]

Bed_durationmins,Depression,DaytimeFunction))

Estimand assumption 1, Unconfoundedness: If U→{MedianBedTime} and U→Year_MS_diagnosed then P(Year_MS_diagnosed|MedianBedTime,BMI,Age,Fatigue,General_health,Bed_durationmins,Depression,DaytimeFunction,U) = P(Year_MS_diagnosed|MedianBedTime,BMI,Age,Fatigue,General_health,Bed_durationmins,Depression,DaytimeFunction)

Estimand : 2

Estimand name: iv No such variable found!

Realized estimand

b: Year_MS_diagnosed~MedianBedTime+BMI+Age+Fatigue+General_health+Bed_durationmins+Depression+DaytimeFunction Target units: att

Estimate

Mean value: nan

Causal Estimate is nan `

Why am I getting 'nan' for the estimate, and why does it say 'No such variable found!' for the instrumental variable, when doing propensity score stratification?

amit-sharma commented 4 years ago

@gabeguo the message "No such variable found" for instrumental variable is standard behavior. It means that during the identification step, both types of identification were tried out (backdoor and IV), and no IV variable was found. During estimation, however the backdoor identification is used since you provided propensity score stratification.

There may be a number of reasons why you are getting a NaN value for the causal estimate. A few ideas to debug,

Have you checked that there are no null or non-numeric values in your data.frame?
Do you get the same estimate if you try "ate" as the target_units?
You may try to remove a few features to see if the problem still persists.

If the problem persists, can you provide a small dataset example on which you see the bug? You can share a synthetic dataset too. That can help us debug.

gabeguo commented 4 years ago

@amit-sharma I tried using the propensity-based methods for a small synthetic dataset, and I still got NaN as the output. Why does it give NaN?

Here is the synthetic dataset:

a,b,c,d,e True,2,3,10,15 False,5,7,-3,-6 True,8,2,3,5 False,-3,8,98,10 True,21,6,7,12 False,7,12,45,3

Here is the code:

import numpy as np import pandas as pd import logging

import dowhy from dowhy import CausalModel

df = pd.read_csv('fake_data.csv')

print(df)

model=CausalModel( data = df, treatment='a', outcome='c', graph="digraph {a->c; b->a; b->c; d->a; d->c; e->a; e->c;}" ) model.view_model()

from IPython.display import Image, display display(Image(filename="causal_model.png"))

identified_estimand = model.identify_effect() print(identified_estimand)

""" causal_estimate_reg = model.estimate_effect(identified_estimand, method_name="backdoor.linear_regression", test_significance=True) print(causal_estimate_reg) print("Causal Estimate is " + str(causal_estimate_reg.value)) """

causal_estimate_strat = model.estimate_effect(identified_estimand, method_name="backdoor.propensity_score_stratification", target_units="ate") print(causal_estimate_strat) print("Causal Estimate is " + str(causal_estimate_strat.value))

Here is the terminal output:

   a   b   c   d   e

0 True 2 3 10 15 1 False 5 7 -3 -6 2 True 8 2 3 5 3 False -3 8 98 10 4 True 21 6 7 12 5 False 7 12 45 3 INFO:dowhy.causal_graph:If this is observed data (not from a randomized experiment), there might always be missing confounders. Adding a node named "Unobserved Confounders" to reflect this. INFO:dowhy.causal_model:Model to find the causal effect of treatment ['a'] on outcome ['c']

INFO:dowhy.causal_identifier:Common causes of treatment and outcome:['b', 'e', 'U', 'd'] WARNING:dowhy.causal_identifier:If this is observed data (not from a randomized experiment), there might always be missing confounders. Causal effect cannot be identified perfectly. WARN: Do you want to continue by ignoring any unobserved confounders? (use proceed_when_unidentifiable=True to disable this prompt) [y/n] n INFO:dowhy.causal_identifier:Instrumental variables for treatment and outcome:[] Estimand type: nonparametric-ate ### Estimand : 1 Estimand name: backdoor Estimand expression: d ────(Expectation(c|b,e,d)) d[a] Estimand assumption 1, Unconfoundedness: If U→{a} and U→c then P(c|a,b,e,d,U) = P(c|a,b,e,d) ### Estimand : 2 Estimand name: iv No such variable found! INFO:dowhy.causal_estimator:INFO: Using Propensity Score Stratification Estimator INFO:dowhy.causal_estimator:b: c~a+b+e+d /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning. FutureWarning) /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/sklearn/utils/validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/dowhy/causal_estimators/propensity_score_stratification_estimator.py:78: RuntimeWarning: invalid value encountered in double_scalars est = (weighted_outcomes['effect'] * (weighted_outcomes[control_sum_name]+weighted_outcomes[treatment_sum_name])).sum() / total_population *** Causal Estimate *** ## Identified estimand Estimand type: nonparametric-ate ### Estimand : 1 Estimand name: backdoor Estimand expression: d ────(Expectation(c|b,e,d)) d[a] Estimand assumption 1, Unconfoundedness: If U→{a} and U→c then P(c|a,b,e,d,U) = P(c|a,b,e,d) ### Estimand : 2 Estimand name: iv No such variable found! ## Realized estimand b: c~a+b+e+d Target units: ate ## Estimate Mean value: nan Causal Estimate is nan

amit-sharma commented 4 years ago

@gabeguo thank you for providing a reproducible example. I understand the problem now. The method outputs nan whenever there is not enough data points in each stratum for each treatment.

Details: The propensity score stratification method divides the dataset into strata based on the value of the propensity score, and within each strata there should be a minimum number of data points for treatment=1 and treatment=0. It removes any stratum that does not have the minimum number of data points.

Based on the dataset, we need to change the default parameters num_strata (number of strata to divide the dataset into) and clipping_threshold (minimum number of datapoints per stratum per treatment value). Therefore, to obtain a valid causal estimate, for each value of the treatment and each strata, there should be at least clipping_threshold number of data points.

Example: In your fake_data example, try setting clipping_threshold=0 and num_strata=1.

causal_estimate_strat = model.estimate_effect(identified_estimand, method_name="backdoor.propensity_score_stratification", target_units="ate", method_params={'clipping_threshold':0, 'num_strata':1})

Still, for this tiny dataset, you will notice the above change does not work. While it leads to at least one strata having more than clipping_threshold number of data points, each strata has data points with either treatment=0 or treatment=1 (really confounded treatment assignment). In such a case, it is not possible to calculate the effect.

So let's change the dataset slightly so that propensity scores are not that extreme and each strata has both treatment=1 and treatment=0.

True,2,3,10,15
False,2,2,10,15
True,8,2,3,5
False,8,3,3,6
True,21,6,7,12
False,20,4,6,11

Now you will see that your script works with this fake data.

For your real dataset, I suggest decreasing the value of num_strata (default=50) and seeing if it works. I won't suggest changing clipping_threshold since the default is already low (10).

To help future use, I have added more meaningful error messages for this method. You can pull the latest code from master to see those error messages. Instead of nan, the method now outputs:

py-why / dowhy