Can't add a new node to GeneralGraph after deleting edges

py-why / causal-learn

Causal Discovery in Python. It also includes (conditional) independence tests and score functions.

https://causal-learn.readthedocs.io/en/latest/

MIT License

1.16k stars 191 forks source link

Can't add a new node to GeneralGraph after deleting edges #41

Closed AndrewC19 closed 1 year ago

AndrewC19 commented 2 years ago

If I attempt to add a new node to a GeneralGraph using add_node after calling the method remove_nodes, I get the following error:

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 10 and the array at index 1 has size 5.

The more detailed error message:

Traceback (most recent call last):
  File <PATH_TO_SCRIPT>, line 20, in <module>
    graph.add_node(GraphNode("X11"))
  File "<PATH_TO_PROJECT>/ctf_venv/lib/python3.9/site-packages/causallearn/graph/GeneralGraph.py", line 205, in add_node
    dpath = np.vstack((self.dpath, row))
  File "<__array_function__ internals>", line 180, in vstack
  File "<PATH_TO_PROJECT>/ctf_venv/lib/python3.9/site-packages/numpy/core/shape_base.py", line 282, in vstack
    return _nx.concatenate(arrs, 0)
  File "<__array_function__ internals>", line 180, in concatenate

Minimal example to reproduce:

# Generate a graph with 10 edges
nodes = [GraphNode(f"X{i}") for i in range(1, 11)]
graph = GeneralGraph(nodes)
for i in range(9):
    graph.add_edge(Edge(nodes[i], nodes[i+1], Endpoint.TAIL, Endpoint.ARROW))
print(graph)

# Delete the first 5 nodes
nodes_to_remove = nodes[:5]
graph.remove_nodes(nodes_to_remove)

# Try to add a new node (X11)
graph.add_node(GraphNode("X11"))
print(graph)

It appears that self.dpath is not being updated correctly upon calling remove_nodes. If you compare its dimensions to self.graph you can see that it has not deleted the paths involving the removed nodes. For example, in the above example we removed 5 nodes but if we call print(graph.graph.shape, graph.dpath.shape), we get the following: (5, 5) (10, 10).

tofuwen commented 2 years ago

Hi Andrew,

Thanks so much for your input! We really appreciate it.

Currently we are working on "code quality improvement project", and we will fix this issue during the project. (We already included this issue in our tracking doc, and assigned to the owner).

Currently our main focus is to "adding more tests to the causal-learn" project, as without reliable and automatic tests, it's hard to ensure the correctness of the code.

And of course, feel free to send a PR to help us make causal-learn better! :) We are more than happy to review. :)

And regarding the priority of this issue, it seems to me that users can reconstruct the graph from data, instead of calling "add_node()" / "remove_nodes()" right? I just want to make sure that no one is blocked by this issue now. :)

AndrewC19 commented 2 years ago

Hi,

No problems, and thanks for the quick response!

I'm not sure what you mean by reconstructing the graph from data. For my use case, I want to first create a graph by learning its structure from data (using the pc algorithm). Then, I want to merge a series of binary indicator nodes into a single categorical variable. To achieve this, I do the following:

For each categorical variable, obtain its binary indicators, B (e.g. if my category is favourite drink, I might have three indicators drink_coffee, drink_tea, drink_water).
For each indicator in B, record its parents, P, and children, C.
Remove each indicator from the graph.
Add a new node to the graph for the category (e.g. drink)
Add an edge from all nodes in the set of parents P to the new node.
Add an edge from the new node to all nodes in the set of children C.

As you can see, this requires me to update the graph after removing a set of nodes. Is there a way around this for the time being?

AndrewC19 commented 2 years ago

Also, two questions regarding making a PR:

1) Do you have any contributor guidelines? 2) I believe the underlying problem here is the updating of the dpath instance variable for GeneralGraph. Specifically, the method reconstitute_dpath is called from within remove_node with the set of all graph edges as an argument. There doesn't appear to be any documentation for reconstitute_dpath so I'm not sure what it should be doing and, therefore, how to fix it. Could you give me some guidance on what this method should be doing?

tofuwen commented 2 years ago

Hi Andrew,

Thanks for the clear explanation! Yeah, it seems in your use case, you need the remove_node() to behave functionally.

Regarding your first comment, I am wondering why you want to perform PC on the graph generated in this way. How about directly perform PC on graph with categorical variables, instead of the binary indicators? The problem of your current construction is that it violates the faithfulness assumption (i.e. drink_coffee, drink_tea, drink_water forms a deterministic relationship: drink_coffee + drink_tea + drink_water == 1). And as a result, conditioning on any two variables, the third variable is a constant, which is independent to any other variables.
"Do you have any contributor guidelines": currently we don't have one. You reminds me creating a contributor guideline. :) I am wondering in your experience, do you have an example about good contributor guidelines? We just started the "code quality control project", so lots of things are new. Previously each individual researchers contributed their own code to the package. So as you can see, the code style is not consistent across the packages, and we lack good tests. We are currently in progress to make the code quality of causal-learn much better. :)

Regarding making PR to causal-learn, currently what I am thinking is: 1) Follow "Google Python Style Guide" 2) send PR with good descriptions and test plan

Regarding your last question, sorry I am not expert here as well. Let me tag the owner here to better help you. cc @chenweiDelight

AndrewC19 commented 2 years ago

Hi again,

Thanks for the in-depth response. In response to your comments:

Could you explain in slightly more detail how/why this violates the faithfulness assumption?
Could you explain how you would apply PC to data that contains categorical and continuous data? My understanding is that this doesn't work by default and that your categorical variables need to be pre-processed properly.

To this end, I have implemented the Multinomial Logistic Regression Test from the paper "Evaluation of Causal Structure Learning Methods on Mixed Data Types (Raghu, Poon, and Benos, 2018)". Following this approach, if I want to test whether X and Y are independent conditional on an adjustment set S, where X and Y are continuous and some variables in S are categorical, I can convert categorical S to binary indicator variables and apply linear regression. It also states that, if either X or Y are categorical, then we should apply logistic regression.

Based on your first comment, would this approach not violate the faithfulness assumption too?
I would suggest providing instructions on how to create a PR and the general expected quality (e.g. coding standards, maintaining test coverage, passing test cases). Here's an example of good contributor guidelines: https://github.com/pykale/pykale/blob/main/.github/CONTRIBUTING.md.

It might also be worth setting up some continuous integration tests with GitHub actions. This can be implemented so that whenever a PR is made, your tests are run automatically (potentially on multiple different operating systems and python versions) to ensure that your PR does not break the existing code.

Hope that helps!

tofuwen commented 2 years ago

Hi Andrew,

I briefly googled "violation of faithfulness because of deterministic relations", and found https://arxiv.org/pdf/2010.14265.pdf, which states "This assumption can, however, be violated in many ways, including xor connections, deterministic functions or cancelling paths. " But I don't find a good example in the paper --- you probably need to dig more into references to check. But hope this can be served as a good starting point.

I can also give you an example: Consider we have two boolean variables "drink or not", "happy or not". And assume in the true graph, "drink" causes "happy". Now let's create the third variable "not drink" (i.e. "not drink" and "drink" must be one true and one false, they form deterministic relationships)

Consider running PC on the graph with these three variables: "drink or not", "not drink", "happy" ==> we will get that conditioning on "not drink", "drink" is constant, thus "happy" and "drink" is independent, so there won't be an edge between.

This violates faithfulness assumptions, i.e. these independence found in the data are NOT due to separations in the true causal graph.

Hope this helps.

Hmm, I am curious about this. I've tagged experts regarding this questions in another PR. I don't know what's the best way to run PC in causal learn with those data types.
thanks! that's super helpful! :)

AndrewC19 commented 2 years ago

Hi again,

Thanks so much for your help, that's much clearer now. I really appreciate you taking time to respond to my questions in such detail.

I was wondering whether it would be possible to arrange a virtual meeting? I am planning to conduct an experiment as part of my thesis that evaluates how well different causal discovery algorithms apply to different types of computational model data. I am fairly new to causal discovery and would greatly appreciate some help with a few questions that aren't necessarily related to causal-learn.

tofuwen commented 2 years ago

Hi Andrew,

I'd like to help, but I am also relatively new to causal discovery (I just started my PhD last August, and my first year project is not very causal-related), so I am not sure whether I have the ability to help you.

The good news is: my team has lots of causal experts --- how about you sending me the questions via email (yewenf AT andrew.cmu.edu), and let me see whether I know the answers or I can ask some causal experts in my team. :)

We can discuss more in the emails --- let's the thread here to focus on causal-learn related issues, so when others check this issue later, they won't be distracted. :)

chenweiDelight commented 2 years ago

Hi Andrew,

Thank you for pointing out this issue! It seems that there are some bugs in remove_node(). For example, when removing the node, it does not update the 'dpath'. If you would like to reconstruct the graph, I think it needs to fix these bugs first. If you have any ideas or updates about fixing the bugs in remove_node(), it would be great if you send RP.

Another solution is, that you can preprocess the data of the categorical variables (e.g. drink), and then apply PC to the new data to output the graph.

AndrewC19 commented 2 years ago

Hi,

Your proposed solution is what I am doing at the moment; I preprocess my data to convert categorical variables to binary indicators. However, @tofuwen pointed out that this approach would lead to problems as the inclusion of binary indicator variables violates the causal faithfulness assumption.

I'm a little bit confused with how best to proceed!

chenweiDelight commented 2 years ago

Hi Andrew,

I am sorry that I'm confused about your task. For the causal faithfulness assumption, please refer to the paper that @tofuwen mentioned. You can send the detail of your task and your questions to @tofuwen and me (chenweiDelight AT gmail.com)via email. We can discuss more in the emails.