Refactor or redesign implementation of graphs with directed and undirected edges

jaron-lee commented 1 year ago

Is your feature request related to a problem? Please describe. At the moment there is a graph class pywhy_graphs.CPDAG that handles graphs with directed and undirected edges. However, such graphs appear in other contexts throughout causal inference; for example, chain graphs. These graphs have the same representation but have a distinct and separate meaning.

However, simply using the CPDAG class to represent chain graphs could be confusing as these concepts are distinct in the literature.

Describe the solution you'd like A simple way to address this is to rename the CPDAG class to something more generic. Unfortunately the correct term (Acyclic Directed Mixed Graph) already applies to graphs with bidirected edges in place of undirected edges; thus another term like "acyclic directed undirected graph" might be considered.

CPDAGs and chain graphs would use this new class to implement their graphical representations. The onus would be on the user to pass this to downstream functions in a way that makes sense (e.g. it would be up to the user to not take output from the PC algorithm and fit a chain graph model to it, for example).

Describe alternatives you've considered Implement a separate chain graph class, with potential refactoring of the reusable parts for both into a parent class. This would leave recognizable API graph classes that might help users who are familiar with the literature to use the software.

adam2392 commented 1 year ago

In a more general case, we also have the case of an ADMG and a MAG. In the case of a MAG, the difference is that there is only one edge allowed among nodes, and that there are no primitive inducing paths among non-adjacent nodes.

jaron-lee commented 1 year ago

So the way i would imagine it is we have a base directed-undirected edge graph class, and then constructors for each type of graph that enforce whatever requirements the specific graph requires. I'll take a stab at this some time this week.

adam2392 commented 1 year ago

So the way i would imagine it is we have a base directed-undirected edge graph class, and then constructors for each type of graph that enforce whatever requirements the specific graph requires. I'll take a stab at this some time this week.

Do we need a constructor for each type of graph necessarily? Perhaps just assume user knows what they are doing and then provide a function to check validity? Or is that too expensive...

Do you think we should also have a base directed-undirected-bidirected-circle graph class then as well? This could lead to a PAG, ADMG (causal DAG w/ bidirectional edges) and a MAG? We just need functional checks to make sure the corresponding PAG, ADMG and MAG are valid?

jaron-lee commented 1 year ago

This is sort of how ananke is designed - we have a base graph class that has all the edge types and then every kind of graph we actually use is a subclass of that. If we had a directed-undirected-bidirected-circle graph class then everything could just be constructed from that class I suppose. I imagine that for e.g. ADMGs you would just set the undirected and circle parts to empty graphs by default?

jaron-lee commented 1 year ago

@adam2392 I just reviewed the PAG code and there is a function is_valid_mec_graph that appears to enforce the validity check in the PAG class. Is that understanding correct?

I believe this only enforces the validity when adding edges. Probably poorly named. I think we can refactor these names once we decide on a proper class for:
* mixed-edge graphs that support Chain Graphs, CPDAG, PDAGs and their extensions

* mixed-edge graphs that support ADMG/DAGs and their extensions

* mixed-edge graphs that support PAGs and their extensions
What are some good generic names for the directed/undirected graph and the directed/bidirected graph and the directed/bidirected/circle/undirected graph?

Referencing discussion in #67.

As far as names we could always go DUGraph, DBGraph, DBCUGraph. I am still of the opinion that we should create an API so that users don't have to call these graph types directly, but that we have better organization and less duplication of code in the implementation. Best of both worlds, I suppose. Consequently the naming shouldn't be that important.

So we have 4 types of edges - directed, bidirected, circle, and undirected. We could in theory create all 4 choose k subtypes for k > 0 (starting with the three that you mentioned above). But if we are creating e.g. a directed-bidirected graph, and then setting the bidirected argument to be empty, it seems somewhat arbitrary? Why would we not take the directed-bidirected-circle-undirected, and then just set all but directed to be empty grpahs?

It seems like we could create the one DBCUGraph with all four edge types and then for each kind of graph we actualyl want to consider (PAG, ADMG, DAG, CPDAG etc.) we create a wrapper around DBCUGraph that zeroes out the appropriate edge graphs.

If we are seriously following the networkx style of detaching the graphical representation from the operations on the graph (including checks such as validity) then this implementation should suffice, seeing as there should not be much need for any inheritance behaviors.

jaron-lee commented 1 year ago

I propose to start with the following, and will open a PR/RFC to this effect:

A new DiUnGraph (will take suggestions on names) that contains bidirected and directed edge types
Provide constructors for CPDAG/PDAG and ChainGraph (that construct a valid DiUnGraph of each type)
Implement functions that check validity of each of these graphs.

Future proposals:

Create DiBiGraph for ADMGs
DAGs should use DiGraph. There should be nothing in ADMG that we require for DAGs
DiBiCiUnGraph (will take suggestions on this name...) for PAGs and extensions.

adam2392 commented 1 year ago

As far as names we could always go DUGraph, DBGraph, DBCUGraph. I am still of the opinion that we should create an API so that users don't have to call these graph types directly, but that we have better organization and less duplication of code in the implementation. Best of both worlds, I suppose. Consequently the naming shouldn't be that important. It seems like we could create the one DBCUGraph with all four edge types and then for each kind of graph we actualyl want to consider (PAG, ADMG, DAG, CPDAG etc.) we create a wrapper around DBCUGraph that zeroes out the appropriate edge graphs.

To summarize and clarify, is your proposal open to the idea of either:

a series of base graph classes: DUGraph, DBGraph, DBCUGraph
one base graph class DBCUGraph

? I don't have a great reason to support one over the other currently :p. But see below on some cons of supporting explicitly the class types ADMG, PAG, DAG, CPDAG, etc. The graph classes are not guaranteed without a check to be a "valid ADMG/PAG/CPDAG/etc.". This is an issue that was raised in #67.

If we are seriously following the networkx style of detaching the graphical representation from the operations on the graph (including checks such as validity) then this implementation should suffice, seeing as there should not be much need for any inheritance behaviors.

The reason the networkx design is "nice" imo (but not ideal) is that checking for validity of a certain "type of graph" is quite expensive. E.g. if we want to ensure that a constructed nx.DiGraph is a DAG, then running nx.is_directed_acyclic(self) every single time an edge is added/removed is not good (note this is what they do in pgmpy's implementation, which is why I'm not a fan). The same logic extends to our cases.

adam2392 commented 1 year ago

As an aside, seeing the implementation of CG and CPDAG as a subclass of DUGraph, I'm pondering the tradeoffs here:

Support explicit causal graph classes: Can we guarantee at all stages of construction and modification of the graph that the graph is "valid"? Can we do this efficiently?
Don't support explicit causal graph classes: Can we make sure our documentation is clear and an API is very clear on how to construct various causal graphs and optionally check their validity?

If we can't arrive at a consensus here, we can discuss next week, or something?

jaron-lee commented 1 year ago

Apologies for taking a while to get back to this. This would be worth talking about actually, if you have some time tomorrow after the usual meeting. Or happy to discuss another time if that doesn't work.

Regarding the proposal of DUGraph + all variants versus DBCUGraph, I am actually not super sure myself. I do think not supporting explicit graph types ADMG or CPDAG would be weird interface-wise, but perhaps the right way to think about this if we don't implement any algorithms inside the graphs themselves.

For 1., maybe we can lazily evaluate the validity of the graph (e.g. algorithms don't assume it's a valid instance of the class)? I think you're right that e.g. checking validity at each step is probably not going to be efficient or very tedious to

I think 2. makes the most sense design wise. Perhaps we can talk about potential API designs?

adam2392 commented 1 year ago

cc: @robertness @bloebp @kunwuz

adam2392 commented 1 year ago

Discussion with @jaron-lee we have decided the following strategy:

Implement 4 private abstractions on top of MixedEdgeGraph that enables all causal-type graphs: DiBiGraph, DiUnGraph, DiBiUnGraph, DiBiUnCiGraph
Still support specific causal graphs and keep API as is... for now while noting in all the documentation that VALIDITY IS NOT GUARANTEED unless they call specific functions that check it. E.g. is_valid_cpdag(...), is_valid_pag(...), which are still TBD implemented

In the longer-term, we can consider removing all the specific causal graph class implementations, so that way it's more like networkx, but in the short-term, we can just rely on this approach to be internally lean, and rely on user-feedback to guide this decision.

Feel free to add anything I missed.

jaron-lee commented 1 year ago

Check bnlearn, and Meek for the CPDAG definition.

py-why / pywhy-graphs

Refactor or redesign implementation of graphs with directed and undirected edges #65