py-why / dowhy

DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
https://www.pywhy.org/dowhy
MIT License
6.99k stars 923 forks source link

Default Significance Level in Refutation Tests #1004

Closed kangqiao-ctrl closed 1 year ago

kangqiao-ctrl commented 1 year ago

Ask your question

809 fixed a previous typo for the significance p-value. However, the current value of 0.95 seems to need clarification.

Expected behavior Based on the comments on line 202, it appears that the default value for the significance level should be 0.05,

Version information:

Additional context Related issues and comments:

929 The interpretations by @drawlinson in this question are really straightforward.

879

809

The bootstrap test calculates a two-sided percentile p-value as follows:

def perform_bootstrap_test(estimate, simulations: List):
    # This calculates a two-sided percentile p-value
    # See footnotes in https://journals.sagepub.com/doi/full/10.1177/2515245920911881
    half_p_value = np.mean([(x > estimate.value) + 0.5 * (x == estimate.value) for x in simulations])
    return 2 * min(half_p_value, 1 - half_p_value)

The p-value is calculated as twice the frequency that the estimate is more extreme than all the samples.

If my understanding is correct, a 0.95 p-value implies a minor (negligible) shift, similar to the 50/50 scenario in #929; And a 0.05 p-value implies a significant shift, similar to the 98/2 scenario in #929.

Thus, if the intention is to enforce a stringent test, then setting the current significance level at 0.95 (or even 0.85) could be justified. In that case, seeing 30% extreme cases will be deemed "lower than the significant level" and the refutation result will be marked as "significant", implying we are not tolerating such a shift. If this is the case, we can directly add this rationale to the comments accordingly.

However, if the aim is not to impose such a strict test, it might be more appropriate to revert the default significance level back to the conventional 0.05 (also align with the suggestion in the comments). In such a scenario, a significant refutation test result would suggest that the test outcome is extremely different than the original estimate. And a shift as in the 70/30 scenario would be considered okay (insignificant) just like a 0.95 p-value (98/2 case).

I would appreciate any clarification on this. Thanks!

amit-sharma commented 1 year ago

Ah yes, you are correct! Thanks for raising this. Somehow this bug slipped through--the earlier version had 0.05.

Fixing it now and will release a patch version soon.

kangqiao-ctrl commented 1 year ago

Thanks!