Closed AndrewC19 closed 1 year ago
Hi Andrew,
Thanks so much for your input! We really appreciate it.
Currently we are working on "code quality improvement project", and we will fix this issue during the project. (We already included this issue in our tracking doc, and assigned to the owner).
Currently our main focus is to "adding more tests to the causal-learn" project, as without reliable and automatic tests, it's hard to ensure the correctness of the code.
And of course, feel free to send a PR to help us make causal-learn better! :) We are more than happy to review. :)
And regarding the priority of this issue, it seems to me that users can reconstruct the graph from data, instead of calling "add_node()" / "remove_nodes()" right? I just want to make sure that no one is blocked by this issue now. :)
Hi,
No problems, and thanks for the quick response!
I'm not sure what you mean by reconstructing the graph from data. For my use case, I want to first create a graph by learning its structure from data (using the pc
algorithm). Then, I want to merge a series of binary indicator nodes into a single categorical variable. To achieve this, I do the following:
As you can see, this requires me to update the graph after removing a set of nodes. Is there a way around this for the time being?
Also, two questions regarding making a PR:
1) Do you have any contributor guidelines?
2) I believe the underlying problem here is the updating of the dpath
instance variable for GeneralGraph
. Specifically, the method reconstitute_dpath
is called from within remove_node
with the set of all graph edges as an argument. There doesn't appear to be any documentation for reconstitute_dpath
so I'm not sure what it should be doing and, therefore, how to fix it. Could you give me some guidance on what this method should be doing?
Hi Andrew,
Thanks for the clear explanation! Yeah, it seems in your use case, you need the remove_node() to behave functionally.
Regarding your first comment, I am wondering why you want to perform PC on the graph generated in this way. How about directly perform PC on graph with categorical variables, instead of the binary indicators? The problem of your current construction is that it violates the faithfulness assumption (i.e. drink_coffee, drink_tea, drink_water forms a deterministic relationship: drink_coffee + drink_tea + drink_water == 1). And as a result, conditioning on any two variables, the third variable is a constant, which is independent to any other variables.
"Do you have any contributor guidelines": currently we don't have one. You reminds me creating a contributor guideline. :) I am wondering in your experience, do you have an example about good contributor guidelines? We just started the "code quality control project", so lots of things are new. Previously each individual researchers contributed their own code to the package. So as you can see, the code style is not consistent across the packages, and we lack good tests. We are currently in progress to make the code quality of causal-learn much better. :)
Regarding making PR to causal-learn, currently what I am thinking is: 1) Follow "Google Python Style Guide" 2) send PR with good descriptions and test plan
Hi again,
Thanks for the in-depth response. In response to your comments:
Could you explain in slightly more detail how/why this violates the faithfulness assumption?
Could you explain how you would apply PC to data that contains categorical and continuous data? My understanding is that this doesn't work by default and that your categorical variables need to be pre-processed properly.
To this end, I have implemented the Multinomial Logistic Regression Test from the paper "Evaluation of Causal Structure Learning Methods on Mixed Data Types (Raghu, Poon, and Benos, 2018)". Following this approach, if I want to test whether X and Y are independent conditional on an adjustment set S, where X and Y are continuous and some variables in S are categorical, I can convert categorical S to binary indicator variables and apply linear regression. It also states that, if either X or Y are categorical, then we should apply logistic regression.
Based on your first comment, would this approach not violate the faithfulness assumption too?
I would suggest providing instructions on how to create a PR and the general expected quality (e.g. coding standards, maintaining test coverage, passing test cases). Here's an example of good contributor guidelines: https://github.com/pykale/pykale/blob/main/.github/CONTRIBUTING.md.
It might also be worth setting up some continuous integration tests with GitHub actions. This can be implemented so that whenever a PR is made, your tests are run automatically (potentially on multiple different operating systems and python versions) to ensure that your PR does not break the existing code.
Hope that helps!
Hi Andrew,
I can also give you an example: Consider we have two boolean variables "drink or not", "happy or not". And assume in the true graph, "drink" causes "happy". Now let's create the third variable "not drink" (i.e. "not drink" and "drink" must be one true and one false, they form deterministic relationships)
Consider running PC on the graph with these three variables: "drink or not", "not drink", "happy" ==> we will get that conditioning on "not drink", "drink" is constant, thus "happy" and "drink" is independent, so there won't be an edge between.
This violates faithfulness assumptions, i.e. these independence found in the data are NOT due to separations in the true causal graph.
Hope this helps.
Hmm, I am curious about this. I've tagged experts regarding this questions in another PR. I don't know what's the best way to run PC in causal learn with those data types.
thanks! that's super helpful! :)
Hi again,
Thanks so much for your help, that's much clearer now. I really appreciate you taking time to respond to my questions in such detail.
I was wondering whether it would be possible to arrange a virtual meeting? I am planning to conduct an experiment as part of my thesis that evaluates how well different causal discovery algorithms apply to different types of computational model data. I am fairly new to causal discovery and would greatly appreciate some help with a few questions that aren't necessarily related to causal-learn.
Hi Andrew,
I'd like to help, but I am also relatively new to causal discovery (I just started my PhD last August, and my first year project is not very causal-related), so I am not sure whether I have the ability to help you.
The good news is: my team has lots of causal experts --- how about you sending me the questions via email (yewenf AT andrew.cmu.edu), and let me see whether I know the answers or I can ask some causal experts in my team. :)
We can discuss more in the emails --- let's the thread here to focus on causal-learn related issues, so when others check this issue later, they won't be distracted. :)
Hi Andrew,
Thank you for pointing out this issue! It seems that there are some bugs in remove_node(). For example, when removing the node, it does not update the 'dpath'. If you would like to reconstruct the graph, I think it needs to fix these bugs first. If you have any ideas or updates about fixing the bugs in remove_node(), it would be great if you send RP.
Another solution is, that you can preprocess the data of the categorical variables (e.g. drink), and then apply PC to the new data to output the graph.
Hi,
Your proposed solution is what I am doing at the moment; I preprocess my data to convert categorical variables to binary indicators. However, @tofuwen pointed out that this approach would lead to problems as the inclusion of binary indicator variables violates the causal faithfulness assumption.
I'm a little bit confused with how best to proceed!
Hi Andrew,
I am sorry that I'm confused about your task. For the causal faithfulness assumption, please refer to the paper that @tofuwen mentioned. You can send the detail of your task and your questions to @tofuwen and me (chenweiDelight AT gmail.com)via email. We can discuss more in the emails.
If I attempt to add a new node to a
GeneralGraph
usingadd_node
after calling the methodremove_nodes
, I get the following error:The more detailed error message:
Minimal example to reproduce:
It appears that
self.dpath
is not being updated correctly upon callingremove_nodes
. If you compare its dimensions toself.graph
you can see that it has not deleted the paths involving the removed nodes. For example, in the above example we removed 5 nodes but if we callprint(graph.graph.shape, graph.dpath.shape)
, we get the following:(5, 5) (10, 10)
.