py-why / dowhy

DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
https://www.pywhy.org/dowhy
MIT License
6.99k stars 923 forks source link

Possible bug in the falsify dag notebook. #1010

Closed Nitesh-K-Singh closed 1 year ago

Nitesh-K-Singh commented 1 year ago

I am using this notebook (https://github.com/py-why/dowhy/blob/main/docs/source/example_notebooks/gcm_falsify_dag.ipynb) on my own dataset and dag.

This is the error that i get:

"""

"ZeroDivisionError Traceback (most recent call last)

in () 1 g_true = graph ----> 2 result = falsify_graph(g_true, data_p, n_permutations=1000, plot_histogram=True) 3 # Summarize the result 4 print(result) /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/dowhy/gcm/falsify.py in falsify_graph(causal_graph, data, suggestions, independence_test, conditional_independence_test, significance_level, significance_ci, n_permutations, show_progress_bar, n_jobs, plot_histogram, plot_kwargs) 628 summary[m][FalsifyConst.GIVEN_VIOLATIONS] = summary_given[m][FalsifyConst.N_VIOLATIONS] 629 summary[m][FalsifyConst.N_TESTS] = summary_given[m][FalsifyConst.N_TESTS] --> 630 summary[m][FalsifyConst.F_PERM_VIOLATIONS] = [ 631 perm[FalsifyConst.N_VIOLATIONS] / perm[FalsifyConst.N_TESTS] for perm in summary_perm[m] 632 ] /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/dowhy/gcm/falsify.py in (.0) 629 summary[m][FalsifyConst.N_TESTS] = summary_given[m][FalsifyConst.N_TESTS] 630 summary[m][FalsifyConst.F_PERM_VIOLATIONS] = [ --> 631 perm[FalsifyConst.N_VIOLATIONS] / perm[FalsifyConst.N_TESTS] for perm in summary_perm[m] 632 ] 633 summary[m][FalsifyConst.F_GIVEN_VIOLATIONS] = ( " """ Note that the command is finishing all the iterations but at the end I am getting a 'division by zero' error at the end Test permutations of given graph: 100%|██████████| 1000/1000 [14:58<00:00, 1.11it/s] ZeroDivisionError: division by zero
amit-sharma commented 1 year ago

@eeulig can you take a look?

eeulig commented 1 year ago

Thanks for reporting @Nitesh-K-Singh! Can it be that your DAG does not imply any d-separations (e.g. a fully-connected DAG)? This would lead to the reported error when we compute the fraction of LMC violations. I'll work on a fix in the next days. However, if your DAG does indeed imply no d-separations, then there is nothing we can test and thus you cannot expect any insight from our metric.

Nitesh-K-Singh commented 1 year ago

This (https://colab.research.google.com/drive/1ivFrOx0bV5ixq3hY2mLUyF2AwZcb8-KC#scrollTo=AcLFmXHVrv-t ) is the graph that I am using ( with the variables' names changed for anonymity).

In the attached notebook, I have used daggity to check all the implied independence conditions.

Also, the graph above is more or less the 'true' graph as these variables are separated in time and the way they are related. I reversed all the edges of node X_0 (the variable representing this node is the main outcome we are interested in and so is collected with a ~4 weeks lag compared top the others) but i got the same error.

Will be happy to provide any other details that might help you solve this.

eeulig commented 1 year ago

Thanks for providing more details. The issue is that the variable names in your graph must match the column names in your data. But when you define your gml string you have additional spaces for all of your variables except for "X_0":

gml_string = """graph [
  directed 1

  node [
    id 0
    label "X_0"
  ]

node [
    id 1
    label "X_1 "
  ]

node [
    id 2
    label "X_2 "
  ]

node [
    id 3
    label "X_3 "
  ]
node [
    id 4
    label "X_4 "
  ]
node [
    id 5
    label "X_5 "
  ]
node [
    id 6
    label "X_6 "
  ]
node [
    id 7
    label "X_7 "
  ]
node [
    id 8
    label "X_8 "
  ]
node [
    id 9
    label "  X_9 "
  ]
node [
    id 10
    label "X_10 "
  ]
node [
    id 11
    label "X_11 "
  ]

If you remove those additional spaces, everything works as expected (see here). Actually, we raise a warning if we don't find data for some node (and consequently cannot test independence). However, since we parallelize the tests using joblib we need to control the warning behavior using an environment variable for those subprocesses (see here and here for more information). If we put

import os, warnings
warnings.simplefilter("default")
os.environ["PYTHONWARNINGS"] = "default"

before your other python code this gives me the following output in your notebook (with the wrong node names):

/usr/local/lib/python3.10/dist-packages/dowhy/gcm/falsify.py:841: UserWarning: WARN: Couldn't find data for node X_8 . Skip this test.
  warnings.warn(f"WARN: Couldn't find data for node {node}. Skip this test.")
/usr/local/lib/python3.10/dist-packages/dowhy/gcm/falsify.py:841: UserWarning: WARN: Couldn't find data for node X_3 . Skip this test.
  warnings.warn(f"WARN: Couldn't find data for node {node}. Skip this test.")
/usr/local/lib/python3.10/dist-packages/dowhy/gcm/falsify.py:841: UserWarning: WARN: Couldn't find data for node   X_9 . Skip this test.
  warnings.warn(f"WARN: Couldn't find data for node {node}. Skip this test.")
/usr/local/lib/python3.10/dist-packages/dowhy/gcm/falsify.py:841: UserWarning: WARN: Couldn't find data for node X_1 . Skip this test.
  warnings.warn(f"WARN: Couldn't find data for node {node}. Skip this test.")
/usr/local/lib/python3.10/dist-packages/dowhy/gcm/falsify.py:841: UserWarning: WARN: Couldn't find data for node X_11 . Skip this test.
  warnings.warn(f"WARN: Couldn't find data for node {node}. Skip this test.")
/usr/local/lib/python3.10/dist-packages/dowhy/gcm/falsify.py:841: UserWarning: WARN: Couldn't find data for node X_10 . Skip this test.
  warnings.warn(f"WARN: Couldn't find data for node {node}. Skip this test.")
/usr/local/lib/python3.10/dist-packages/dowhy/gcm/falsify.py:841: UserWarning: WARN: Couldn't find data for node X_2 . Skip this test.
  warnings.warn(f"WARN: Couldn't find data for node {node}. Skip this test.")
/usr/local/lib/python3.10/dist-packages/dowhy/gcm/falsify.py:841: UserWarning: WARN: Couldn't find data for node X_4 . Skip this test.
  warnings.warn(f"WARN: Couldn't find data for node {node}. Skip this test.")
/usr/local/lib/python3.10/dist-packages/dowhy/gcm/falsify.py:841: UserWarning: WARN: Couldn't find data for node X_5 . Skip this test.
  warnings.warn(f"WARN: Couldn't find data for node {node}. Skip this test.")
/usr/local/lib/python3.10/dist-packages/dowhy/gcm/falsify.py:841: UserWarning: WARN: Couldn't find data for node X_6 . Skip this test.
  warnings.warn(f"WARN: Couldn't find data for node {node}. Skip this test.")
/usr/local/lib/python3.10/dist-packages/dowhy/gcm/falsify.py:841: UserWarning: WARN: Couldn't find data for node X_7 . Skip this test.
  warnings.warn(f"WARN: Couldn't find data for node {node}. Skip this test.")
Test permutations of given graph: 100%|██████████| 100/100 [00:01<00:00, 62.27it/s]
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
[<ipython-input-18-3a348620a7a9>](https://dyf7hba42fd-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230810-060149-RC01_555586775#) in <cell line: 2>()
      1 # Run evaluation for consensus graph and data.
----> 2 result_falsify = falsify_graph(graph, data, n_permutations=100,
      3                               independence_test=_gcm_linear,
      4                               conditional_independence_test=_gcm_linear,
      5                               plot_histogram=True, n_jobs=1)

1 frames
[/usr/local/lib/python3.10/dist-packages/dowhy/gcm/falsify.py](https://dyf7hba42fd-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230810-060149-RC01_555586775#) in <listcomp>(.0)
    629         summary[m][FalsifyConst.N_TESTS] = summary_given[m][FalsifyConst.N_TESTS]
    630         summary[m][FalsifyConst.F_PERM_VIOLATIONS] = [
--> 631             perm[FalsifyConst.N_VIOLATIONS] / perm[FalsifyConst.N_TESTS] for perm in summary_perm[m]
    632         ]
    633         summary[m][FalsifyConst.F_GIVEN_VIOLATIONS] = (

ZeroDivisionError: division by zero

telling us that there is some mismatch between graph and data provided. Alternatively, you can set n_jobs=1 when calling falsify_graph, this will also avoid suppression of the warnings since everything runs in the main process then.

Nitesh-K-Singh commented 1 year ago

Thank you so much!

I removed the spaces from variable names and it worked. Also, the link to suppressing warnings during parallel execution is super helpful!