py-why / dowhy

DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
https://www.pywhy.org/dowhy
MIT License
7.07k stars 927 forks source link

Bootstrap-based results are ambiguous #879

Open Klesel opened 1 year ago

Klesel commented 1 year ago

Using bootstrap samples to test the estimates is ambiguous. Here is how the current output looks like: image

Here is a reproducible example:

import numpy as np
from dowhy import CausalModel
import dowhy.datasets 

data = dowhy.datasets.linear_dataset(beta=10,
        num_common_causes=5,
        num_instruments = 2,
        num_effect_modifiers=1,
        num_samples=5000, 
        treatment_is_binary=True,
        stddev_treatment_noise=10,
        num_discrete_common_causes=1)
df = data["df"]

model=CausalModel(
        data = df,
        treatment=data["treatment_name"],
        outcome=data["outcome_name"],
        graph=data["gml_graph"]
        )

identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
print(identified_estimand)

causal_estimate = model.estimate_effect(identified_estimand,
        method_name="backdoor.linear_regression",
        test_significance="bootstrap",
        confidence_intervals=True)
print(causal_estimate)

I raised another issue related to documentation: https://github.com/py-why/dowhy/issues/816 If you prefere, we can merge both issues.

drawlinson commented 1 year ago

@Klesel As I understand it, a p-value range is produced when the test statistic is more extreme than all the samples in the null distribution. I think the intent is to communicate that the p-value was not estimated to be equal to some value, but is less than or greater than a range of values.

It occurs because the p-value can't be estimated using the implemented bootstrap technique; instead we can only say the p-value lies within a range from [zero to n], or alternatively from [n to 1] (either could occur).

drawlinson commented 1 year ago

@Klesel I added my interpretation of the bootstrap significance test to this issue: https://github.com/py-why/dowhy/issues/929 It is based on reverse-engineering the code.