py-why / pywhy-graphs

[Experimental] Causal graphs that are networkx-compliant for the py-why ecosystem.
https://py-why.github.io/pywhy-graphs/dev/index.html
MIT License
46 stars 8 forks source link

[ENH] Add ability to check validity of PAGs #73

Closed aryan26roy closed 10 months ago

aryan26roy commented 1 year ago

Closes #67

Changes proposed in this pull request:

Before submitting

After submitting

codecov-commenter commented 1 year ago

Codecov Report

Merging #73 (006c8e3) into main (8658c5b) will decrease coverage by 5.19%. The diff coverage is 7.10%.

@@            Coverage Diff             @@
##             main      #73      +/-   ##
==========================================
- Coverage   84.42%   79.23%   -5.19%     
==========================================
  Files          34       34              
  Lines        2555     2736     +181     
  Branches      687      750      +63     
==========================================
+ Hits         2157     2168      +11     
- Misses        251      421     +170     
  Partials      147      147              
Impacted Files Coverage Δ
pywhy_graphs/algorithms/pag.py 54.92% <7.10%> (-36.10%) :arrow_down:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

aryan26roy commented 1 year ago

@adam2392 I made this draft PR since I am uncertain about a couple of things:

adam2392 commented 1 year ago

Thanks for opening up the PR to get early feedback!

  • Is this the right place to put this function? (I thought of making it a method of the PAG class but you said you intend to do away with these specific classes, so chose to put it in this file instead.)

The location of the file is fine.

  • Right now I am only checking if the provided PAG has circle edges, is this enough?

No, you have to follow the theorem linked and then the definition of the MAG. Just cuz you have or don't have circle edges does not imply anything.

  • I implemented the first step of the theorem 2 from the linked paper. The second step requires getting rid of all unshielded colliders, for this I have to add edges between several node. What should these edges be? Directed? Undirected? Bidirected? If they are directed, in what direction?

I believe theorem 2 says o-> into -> and -o into ->, so directed, so I don't know if you did the first step correctly; the -o case is not handled. Take a look at the docstring of PAG to see how we implement these edge patterns.

Orienting to a DAG with no unshielded colliders means for circle triangles (circle triplets that are shielded) they can be oriented arbitrarily since a collider is possible but can't be detected due to the shieldedness. For circle triplets that are unshielded, they can be oriented arbitrarily as long as it's not a collider. So it's a combo of directed and bidirected edges. Maybe you can find the tetrad code that implements this for inspiration.

This step may require some generic graph traversal... this issue might actually be harder than I thought, but it's just a standard graph BFS/DFS traversal if you're familiar with graph algorithms. E.g. see the other algorithms, and what they do.

If this part is an issue, lmk. You can proceed as if tho there is a function that does the above.

aryan26roy commented 1 year ago

I found the tetrad code for turning a PAG into an MAG!

No, you have to follow the theorem linked and then the definition of the MAG. Just cuz you have or don't have circle edges does not imply anything.

I meant to ask if this check is enough as a pre-cursor to the actual validity check. I am assuming that a graph cannot be a PAG if it doesn't have any circle edges, and if so, it is automatically classified as an invalid PAG.

This step may require some generic graph traversal... this issue might actually be harder than I thought, but it's just a standard graph BFS/DFS traversal if you're familiar with graph algorithms. E.g. see the other algorithms, and what they do.

I am! (I took multiple classes on graphs and I have done a past project where I implemented some of the graph traversal algorithms for robotics application.) I still expect to encounter some difficulties though. I will look through the tetrad code and let you know if I am unable to understand what to do exactly after that.

aryan26roy commented 1 year ago

Ok so I looked it up and it turns out that a graph can be a PAG even if it doesn't have any circle edges.

adam2392 commented 1 year ago

Feel free to ping me when/if there's any issues or you need a review.

aryan26roy commented 1 year ago

@adam2392 sorry for the delay but I have been stuck with a work emergency. Will get back to this issue from Wednesday.

Feel free to ping me when/if there's any issues or you need a review.

Occasionally I have a problem understanding some theoretical things and I feel like GitHub is not the place clarify them. Is there a better way to communicate with you (like email or slack?) or would like me to ask all of those question here as well?

adam2392 commented 1 year ago

@adam2392 sorry for the delay but I have been stuck with a work emergency. Will get back to this issue from Wednesday.

Sounds good no rush!

Occasionally I have a problem understanding some theoretical things and I feel like GitHub is not the place clarify them. Is there a better way to communicate with you (like email or slack?) or would like me to ask all of those question here as well?

Feel free to summarize them here, in case other devs following e.g. @jaron-lee. If it warrants a call, then we can hop on discord. Are you on the discord for pywhy?

aryan26roy commented 1 year ago

Feel free to summarize them here, in case other devs following e.g. @jaron-lee. If it warrants a call, then we can hop on discord. Are you on the discord for pywhy?

I am not. But I will join now. Did not know a discord server existed for pywhy (I was expecting a slack channel but I like Discord more anyway :-))

aryan26roy commented 1 year ago

@robertness you're right. I have been planning to make major chunks of the implementation modular (starting from the meek rules).

aryan26roy commented 1 year ago

@adam2392 I am trying to import the function and use it in a script like this: pywhy_graphs.algorithms.is_valid_PAG(pag) The output is this error:

AttributeError: module 'pywhy_graphs.algorithms' has no attribute 'is_valid_PAG'

Do you know why?

adam2392 commented 1 year ago

@adam2392 I am trying to import the function and use it in a script like this: pywhy_graphs.algorithms.is_valid_PAG(pag) The output is this error:

AttributeError: module 'pywhy_graphs.algorithms' has no attribute 'is_valid_PAG'

Do you know why?

You have to expose it to the import module by adding an entry to __all__ inside the algorithms/pag.py file at the top.

aryan26roy commented 1 year ago

@adam2392 I think I have constructed the MAG correctly (I think the rule 4 is correctly implemented). Now how do I check the validity of an MAG? Can you explain it to me in a detailed way?

adam2392 commented 1 year ago

@adam2392 I think I have constructed the MAG correctly (I think the rule 4 is correctly implemented). Now how do I check the validity of an MAG? Can you explain it to me in a detailed way?

What regarding the MAG validity do you have a question on from the paper's definition?

Some things you might want to consider are to start adding unit tests for this function, so that way we can explicitly review when the function returns True/False. You can see how in the other functions, we explicitly construct graphs that meet some condition, and then does not.

adam2392 commented 1 year ago

Reposting here question from @aryan26roy:

I am going through the definition of MAGs in the paper and I have some questions about it.

  • What is an almost directed cycle?
  • Does "spouse" mean two nodes that have a bidirected edge?
  • I am a bit unsure about the m-seperation. Do you have a resource that has implemented a check for m-separation? > Or some other easier way understanding it?

In the paper:

We technically don't have functions implementing almost directed cycle, so we could create a function. It should be pretty straightforward. E.g.

def has_almost_directed_cycle(G):
     pass

Re m-separation: https://www.pywhy.org/pywhy-graphs/dev/generated/pywhy_graphs.networkx.m_separated.html#pywhy_graphs.networkx.m_separated

aryan26roy commented 1 year ago

Ok so I seem to have a picture of what I need to implement:

aryan26roy commented 1 year ago

@adam2392 The function that you gave a link to, requires 'z' a set of nodes that m-separate the two nodes. Do I pass an empty set there?

adam2392 commented 1 year ago

@adam2392 The function that you gave a link to, requires 'z' a set of nodes that m-separate the two nodes. Do I pass an empty set there?

What are you trying to do?

aryan26roy commented 1 year ago

@adam2392 What I am trying to ask is, given that the paper says: An ancestral graph is said to be maximal if for any two non-adjacent vertices, there is a set of vertices that m-separates them. The set of vertices is 'z' for that function. How do I get 'z'? And since the paper (and the function) is talking in terms of sets of nodes X and Y, how do I obtain this split? (One idea is that X contains one node and Y contains all the non-adjacent nodes from that one node)

adam2392 commented 1 year ago

An ancestral graph is said to be maximal if for any two non-adjacent vertices, there is a set of vertices that m-separates them. The set of vertices is 'z' for that function. How do I get 'z'? And since the paper (and the function) is talking in terms of sets of nodes X and Y, how do I obtain this split? (One idea is that X contains one node and Y contains all the non-adjacent nodes from that one node)

Naively check all possible combinations of subsets of vertices. There's probably a more efficient way of doing this I imagine tho.

adam2392 commented 1 year ago

I think the easier thing might be to start with a few tests where this should work and where it shouldn't work. You can start w/ small PAGs and use tetrad GUI to construct examples that you then add into a unit-test.

aryan26roy commented 1 year ago

Ah! Thank you for the suggestion @adam2392. I will have to setup tetrad before that but this will make my job much easier.

aryan26roy commented 1 year ago

@adam2392 I am facing a weird issue. In the middle of the function, I construct a temporary CPDAG. When I print out the edges of the CPDAG, this is what I get:

{'directed': OutEdgeView([('w', 'z'), ('z', 'x'), ('x', 'xy'), ('xy', 'w')]), 'undirected': EdgeView([])}

But when I draw the graph, I get two different graphs on different runs (both are wrong too):

weird2

weird

Note that the edge list remains the same in both the graphs.

adam2392 commented 1 year ago

@adam2392 I am facing a weird issue. In the middle of the function, I construct a temporary CPDAG. When I print out the edges of the CPDAG, this is what I get: But when I draw the graph, I get two different graphs on different runs (both are wrong too): Note that the edge list remains the same in both the graphs.

Do you have some reproducing MVP code for this? This seems like a bug in the draw function then.

Feel free to attend weekly meetings in the causal discovery channel on discord too

aryan26roy commented 1 year ago

@adam2392 I am sharing the MVP code for this through pastebin as it is too long for a github comment. I am planning to attend the last one but something else came in the way. I will be there for the next one!

adam2392 commented 1 year ago

@adam2392 I am sharing the MVP code for this through pastebin as it is too long for a github comment. I am planning to attend the last one but something else came in the way. I will be there for the next one!

For future, it might be helpful to keep the code a bit shorter to make the error apparent.

I took a look tho and it seems like there's probably just a bug in the CPDAG drawing. It is drawing extra edges it should not be. Are we able to replicate this with just two/three nodes?

aryan26roy commented 1 year ago

@adam2392 I have three questions for you.

aryan26roy commented 1 year ago

@adam2392 As for your question, yes, I am able to replicate the bug with just two three nodes.

adam2392 commented 1 year ago

@adam2392 I have three questions for you.

  • The paper says for m-seperation you check if every collider in the path has a descendant in the set 'z' or not. While the function you linked checks if the collider itself is in 'z' or not. Which is the correct check?

The function implements it correctly and checks for more than just collider status. https://github.com/py-why/pywhy-graphs/blob/8658c5b86a221285616f7bd944bd262614c947a7/pywhy_graphs/networkx/algorithms/causal/m_separation.py#L151-L165

Try looking up d-separation materials and reading up on why descendants open up collider paths.

  • To check for m-seperation what I did was instead of making random sets to use as 'z' I made a set of all the colliders in the graph. Now all the colliders in any path are necessarily in this set and all the non-colliders are necessarily not. I think this is a better way doing it, what do you think?

Why not just use minimal_m_separator, which performs searches to determine if a minimal m-separator exists?

@jaron-lee does minimal_m_separator work to just check if two nodes have an m-separating set? It just does the extra work to make sure it is also minimal right? But if two nodes are m-separated, then minimal_m_separator should return some set always?

  • You said that I will have to reconstruct the PAG from the MAG. Is that really important? Because my code does not track the changes made to the graph during the application of the meek rules. Isn't checking the validity of the MAG enough?

This is not true because the PAG reconstructed from the MAG may not match the PAG you fed in.

I think if you implement a few positive/negative test cases starting w/ what is easy, this will be easier to discuss. Do you have such unit tests yet? Do you mind pushing them up to the PR?

adam2392 commented 1 year ago

@adam2392 As for your question, yes, I am able to replicate the bug with just two three nodes.

Can you open a GH issue related to this? and paste the MVP code sample?

aryan26roy commented 1 year ago

Why not just use minimal_m_separator, which performs searches to determine if a minimal m-separator exists?

@adam2392 I didn't know of this function. I will use it now.

This is not true because the PAG reconstructed from the MAG may not match the PAG you fed in.

How come? I was under the belief that when you say 'reconstruct' the PAG, you want me to re-trace the steps followed throughout the function backwards. If I do that correctly, won't the reconstructed PAG always match the input PAG? And if not this, then what do you mean by reconstruct?

I think if you implement a few positive/negative test cases starting w/ what is easy, this will be easier to discuss. Do you have such unit tests yet? Do you mind pushing them up to the PR?

I do not have such test cases yet. Do you know where I can find test cases that have pre-tagged valid and non-valid PAGs?

adam2392 commented 1 year ago

I do not have such test cases yet. Do you know where I can find test cases that have pre-tagged valid and non-valid PAGs?

If you use Tetrad and draw an arbitrary graph with circle endpoints and then run the check via the GUI (I think it's called check valid graph coloring or something like that), you will be able to construct various cases.

aryan26roy commented 10 months ago

@adam2392 this branch has become too convoluted for me to keep everything straight. Do you mind if I delete this and start fresh on a new branch?

adam2392 commented 10 months ago

Yes that is a perfectly fine and acceptable practice