Bootstrap-based results are ambiguous

Klesel commented 1 year ago

Using bootstrap samples to test the estimates is ambiguous. Here is how the current output looks like:

Why is there a list for the p-values? One value is commonly expected when talking about a p-value.
Its ambiguous what the null hypothesis is (i.e., what is tested). Do we expect that the estimate is within our outside the CI?
It would be great to print the result in the output (e.g., Result: significant effect or non-significant effect)

Here is a reproducible example:

import numpy as np
from dowhy import CausalModel
import dowhy.datasets 

data = dowhy.datasets.linear_dataset(beta=10,
        num_common_causes=5,
        num_instruments = 2,
        num_effect_modifiers=1,
        num_samples=5000, 
        treatment_is_binary=True,
        stddev_treatment_noise=10,
        num_discrete_common_causes=1)
df = data["df"]

model=CausalModel(
        data = df,
        treatment=data["treatment_name"],
        outcome=data["outcome_name"],
        graph=data["gml_graph"]
        )

identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
print(identified_estimand)

causal_estimate = model.estimate_effect(identified_estimand,
        method_name="backdoor.linear_regression",
        test_significance="bootstrap",
        confidence_intervals=True)
print(causal_estimate)

I raised another issue related to documentation: https://github.com/py-why/dowhy/issues/816 If you prefere, we can merge both issues.

drawlinson commented 1 year ago

@Klesel As I understand it, a p-value range is produced when the test statistic is more extreme than all the samples in the null distribution. I think the intent is to communicate that the p-value was not estimated to be equal to some value, but is less than or greater than a range of values.

It occurs because the p-value can't be estimated using the implemented bootstrap technique; instead we can only say the p-value lies within a range from [zero to n], or alternatively from [n to 1] (either could occur).

drawlinson commented 1 year ago

@Klesel I added my interpretation of the bootstrap significance test to this issue: https://github.com/py-why/dowhy/issues/929 It is based on reverse-engineering the code.

py-why / dowhy

Bootstrap-based results are ambiguous #879