py-why / dowhy

DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
https://www.pywhy.org/dowhy
MIT License
7.01k stars 922 forks source link

Proposal: Finalize functional API refactor - deprecate causal graph #943

Closed bloebp closed 9 months ago

bloebp commented 1 year ago

Open task is still to replace the usage of CausalModel in the tests and notebooks. There are also some smaller details with the identification methods, which should be double checked.

bloebp commented 1 year ago

Thank you for this extensive PR, @bloebp . Looks great. I added some inline comments. Three high-level comments below:

1. It will be nice to explicitly mention which user-facing functionality will change as a result of this PR? Perhaps you can update the [functional api notebook](https://github.com/py-why/dowhy/blob/main/docs/source/example_notebooks/dowhy_functional_api.ipynb) and remove the CausalGraph object. That will be a good self-contained example to understand the change.

2. I see a few edits to PredictionModel and other classes. They don't seem related to causal graph. Is it possible to split them up as a separate PR?

3. The build test is failing because some notebooks may need to be updated too.

Thanks for looking over this Amit! Just to make sure: Did you double check the changes in the identification methods? I want to make sure that I don't accidentally introduce some bugs there, which will be very difficult to spot afterwards.

bloebp commented 1 year ago

One question: How do we expect users to specify the subset of observed nodes in a graph? In the earlier code, users could specify an attribute in the dot format (observed=yes/no) that allowed them to specify the observed nodes in the graph itself. Should we support that too? Or shall we ask users to specify the observed nodes in python code?

Currently, the user has to specify the observed_nodes, all other nodes in the given graph are then assumed to be unobserved. Unfortunately, if there are multiple methods requiring this input, the user would need to pass it to every method. I was thinking of keeping the 'observed' property in the graph, but then one would need to initialize the graph object again with a helper function. Also a possible way, but maybe less intuitive/clear.

I did not carefully check the functions in graph.py. I assume that they are taken directly from causal_graph.py with minimal change?

Yes, I copied it 1:1 from the causal_graph.py class with the thought of refactoring this another time. I think there are some duplication and some things can be simplified. But maybe too much for this PR then :)

Regarding the failing notebooks, adjusting all the examples and notebooks will require some time. Planning to look at this in the next weeks.

amit-sharma commented 1 year ago

Currently, the user has to specify the observed_nodes, all other nodes in the given graph are then assumed to be unobserved. Unfortunately, if there are multiple methods requiring this input, the user would need to pass it to every method. I was thinking of keeping the 'observed' property in the graph, but then one would need to initialize the graph object again with a helper function. Also a possible way, but maybe less intuitive/clear.

Agree that initializing the graph object with a helper function would be less intuitive to new users. But then it may be too cumbersome to keep specifying observed_nodes for expert users. How about the following specification as a tradeoff:
we declare to users that there are two ways of specifying observed_nodes:

  1. adding it as a parameter each time (Recommended for new users)
  2. adding it as a property "observed=yes/no" on the node in the networkx graph object

In each of the DoWhy functions, the observed_nodes parameter, if set, is used. If it is not set, then we check whether the networkx graph has node property "observed" set. If it is available, we use that. If not, we raise an error. The key part is that DoWhy never updates the graph properties. It is a read-only property that a user can specify on their own, if they want to avoid specifying the parameter (perhaps it is a helper function, but it is optional). Would this be a good tradeoff?

bloebp commented 1 year ago

Currently, the user has to specify the observed_nodes, all other nodes in the given graph are then assumed to be unobserved. Unfortunately, if there are multiple methods requiring this input, the user would need to pass it to every method. I was thinking of keeping the 'observed' property in the graph, but then one would need to initialize the graph object again with a helper function. Also a possible way, but maybe less intuitive/clear.

Agree that initializing the graph object with a helper function would be less intuitive to new users. But then it may be too cumbersome to keep specifying observed_nodes for expert users. How about the following specification as a tradeoff: we declare to users that there are two ways of specifying observed_nodes:

1. adding it as a parameter each time (Recommended for new users)

2. adding it as a property "observed=yes/no" on the node in the networkx graph object

In each of the DoWhy functions, the observed_nodes parameter, if set, is used. If it is not set, then we check whether the networkx graph has node property "observed" set. If it is available, we use that. If not, we raise an error. The key part is that DoWhy never updates the graph properties. It is a read-only property that a user can specify on their own, if they want to avoid specifying the parameter (perhaps it is a helper function, but it is optional). Would this be a good tradeoff?

Yes, we can do this as a trade-off. I am just wondering, how many functions in a "normal workflow" do we expect that a user needs to call where the observed_nodes variable is expected? If it is only 1-2 function calls, then I think its fine to provide the observed_nodes list for both calls. On the other hand, we also don't lose too much with that optional initialize method-

github-actions[bot] commented 1 year ago

This PR is stale because it has been open for 60 days with no activity.

amit-sharma commented 1 year ago

I will take a look and update this PR.

github-actions[bot] commented 11 months ago

This PR is stale because it has been open for 60 days with no activity.

github-actions[bot] commented 10 months ago

This PR was closed because it has been inactive for 7 days since being marked as stale.

amit-sharma commented 9 months ago

Thanks for your patience @bloebp . I have made the following changes and the PR is ready to merge.

  1. Moved observed_nodes to be the third parameter. This allows future PRs to infer observed_nodes by default and then this parameter can be initialized as "None" or "default".
  2. I realized that there is nothing stopping us from allowing nx.digraph input in the CausalModel api. So I've added that too--it is a simple fix in the CausalGraph.py file.
  3. I have added a new test file to test that the functions in graph.py return the same result as the methods in CausalGraph
  4. Fixed some logical bugs in graph.py.
  5. Updated notebooks as needed due to the new graph API.

Comments welcome.