DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
809 fixed a previous typo for the significance p-value. However, the current value of 0.95 seems to need clarification.
Expected behavior
Based on the comments on line 202, it appears that the default value for the significance level should be 0.05,
Version information:
DoWhy version 0.10
Additional context
Related issues and comments:
929 The interpretations by @drawlinson in this question are really straightforward.
879
809
The bootstrap test calculates a two-sided percentile p-value as follows:
def perform_bootstrap_test(estimate, simulations: List):
# This calculates a two-sided percentile p-value
# See footnotes in https://journals.sagepub.com/doi/full/10.1177/2515245920911881
half_p_value = np.mean([(x > estimate.value) + 0.5 * (x == estimate.value) for x in simulations])
return 2 * min(half_p_value, 1 - half_p_value)
The p-value is calculated as twice the frequency that the estimate is more extreme than all the samples.
If my understanding is correct, a 0.95 p-value implies a minor (negligible) shift, similar to the 50/50 scenario in #929;
And a 0.05 p-value implies a significant shift, similar to the 98/2 scenario in #929.
Thus, if the intention is to enforce a stringent test, then setting the current significance level at 0.95 (or even 0.85) could be justified. In that case, seeing 30% extreme cases will be deemed "lower than the significant level" and the refutation result will be marked as "significant", implying we are not tolerating such a shift. If this is the case, we can directly add this rationale to the comments accordingly.
However, if the aim is not to impose such a strict test, it might be more appropriate to revert the default significance level back to the conventional 0.05 (also align with the suggestion in the comments). In such a scenario, a significant refutation test result would suggest that the test outcome is extremely different than the original estimate. And a shift as in the 70/30 scenario would be considered okay (insignificant) just like a 0.95 p-value (98/2 case).
I would appreciate any clarification on this. Thanks!
Ask your question
809 fixed a previous typo for the significance p-value. However, the current value of 0.95 seems to need clarification.
Expected behavior Based on the comments on line 202, it appears that the default value for the significance level should be 0.05,
Version information:
Additional context Related issues and comments:
929 The interpretations by @drawlinson in this question are really straightforward.
879
809
The bootstrap test calculates a two-sided percentile p-value as follows:
The p-value is calculated as twice the frequency that the estimate is more extreme than all the samples.
If my understanding is correct, a 0.95 p-value implies a minor (negligible) shift, similar to the 50/50 scenario in #929; And a 0.05 p-value implies a significant shift, similar to the 98/2 scenario in #929.
Thus, if the intention is to enforce a stringent test, then setting the current significance level at 0.95 (or even 0.85) could be justified. In that case, seeing 30% extreme cases will be deemed "lower than the significant level" and the refutation result will be marked as "significant", implying we are not tolerating such a shift. If this is the case, we can directly add this rationale to the comments accordingly.
However, if the aim is not to impose such a strict test, it might be more appropriate to revert the default significance level back to the conventional 0.05 (also align with the suggestion in the comments). In such a scenario, a significant refutation test result would suggest that the test outcome is extremely different than the original estimate. And a shift as in the 70/30 scenario would be considered okay (insignificant) just like a 0.95 p-value (98/2 case).
I would appreciate any clarification on this. Thanks!