google / guava

Google core libraries for Java
Apache License 2.0
50.02k stars 10.86k forks source link

Feature request: directed acyclic graph implementation in common.graph #2609

Closed jbduncan closed 6 years ago

jbduncan commented 7 years ago

As it turns out, since I originally posted #2411, the needs for my university project evolved a bit, as I realised I really needed a directed acyclic graph (DAG) rather than a tree.

I believe JGraphT provides an implementation of a DAG, but despite that I decided to settle on Guava 20-rc1 and common.graph recently (on one hand because I had licensing worries with JGraphT, and on the other hand because I think common.graph is currently a much cleaner alternative to both JGrapht and JUNG), so I would be excited to see DAG support in say (Graph|ValueGraph|Network)Builder, in a method like supportsCycles(boolean), so I don't have to implement cycle-checking support myself anymore.

What does the Guava team think of this request?

Bezier89 commented 7 years ago

Glad to hear you like common.graph!

This is one thing we considered as an option on the *Builders, but is not in the current release because it requires either A) additional memory overhead (e.g. a UnionFind type data structure kept in sync with the graph) or B) a much slower addEdge() method (checking for a cycle on each add).

There is Graphs.hasCycle(), which if you called after every mutation is closer to the second solution.

jrtom commented 7 years ago

For trees, there is a more efficient way of doing this, but it requires that you build the graph starting from the root node(s).

Basically, you have a somewhat different mutate API, and you keep track of node depth inside the implementation: you either add a root (which is at depth 0) or you add a "child edge" to a node (which gives the child depth 1 + the parent).

For DAGs you'd obviously need something a bit more sophisticated for depth checks (plus a way to add an edge between two existing nodes) and I'm not sure if there's a way to make that work, but...maybe something like that might be at least less overhead than a UnionFind data structure.

It's been on my list of things that I'd like to support, but I don't know if it's going to make it into v21.

m-berling commented 7 years ago

In fact, a DAG would currently be the only item from the common.graph package I really need for my current project. For now I am using a class I have written myself, but I would gladly exchange it for a clean guava implementation. Thus, this addition would be much appreciated, thanks!

Bezier89 commented 7 years ago

I still highly recommend using common.graph rather than rolling your own. Especially in light of some research I did:

A disjoint-set/union-find data structure is sufficient for maintaining an undirected acyclic graph with some memory overhead but virtually (ignoring the inverse Ackermann function) no time overhead for adding edges. Although that's assuming insertions only (no deletions). However, even if we ignore the possibility or removing edges for now, the directed acyclic case is much harder. Maintaining the acyclic guarantee in real-time (that is, with each edge addition/remove) is an open problem in Computer Science with no known efficient algorithm:

http://cstheory.stackexchange.com/q/5179

Compared to the current O(1) runtime for adding edges, all currently known algorithms have unacceptable runtime to be part of a graph library in my opinion.

Common.graph is of course perfectly capable of representing DAGs (and in fact, many of its uses inside Google are DAGs), it just won't maintain the acyclic guarantee. However, if you're able to build up the entire graph at once, a good approach is to do so and then assert that Graphs.hasCycle() is false.

PS: There is currently an internal version of List<N> topologicalOrdering(Graph<N> graph) (where graph is a DAG), which we're hoping to add to Guava in v21.

jbduncan commented 7 years ago

@Bezier89 Regarding maintaining acyclic guarantees in real-time, have you read http://homepages.mcs.vuw.ac.nz/~djp/files/PK-JEA07.pdf?

Admittedly it only applies to DAGs, and I've not had a proper read through it, so I don't know the efficiency of the algorithm discussed, but it looks like it was written relatively recently (2007 I think), so I wonder if it would have useful information for you.

jrtom commented 7 years ago

@jbduncan I don't see how that paper helps. It's about maintaining a topological ordering of the nodes in a DAG as it's mutated (the mutations appear, at a first skim, to be presumed to preserve the DAG properties), not about guaranteeing that a graph continues to be a DAG as it's mutated.

jbduncan commented 7 years ago

Oh woops! Thanks for your feedback @jrtom. That'll teach me to actually read a paper before sharing it with others. :P

jrtom commented 7 years ago

No worries. I admit that I was skeptical, but it would have been really cool if it turned out to have been helpful. :)

My guess is that if you really want real-time DAG updates, you're either willing to pay the performance hit to do the hasCycle() check every time, or you're willing to do a lazy check once a batch of edges have been added (depending on whether it's more important that you be able to identify the specific edge that first triggered the problem, or have good performance). There might also be fast heuristic checks that could catch many common violations, but that sort of solution doesn't sound like a good fit for common.graph.

liach commented 7 years ago

@jrtom Is it possible to make an ImmutableDirectedAcyclicGraph that implements DirectedAcyclicGraph? An immutable one would be possible and it can be simply created via a GraphBuilder or from another graph.

jrtom commented 7 years ago

@liach I'm not sure what benefit you would expect to get from having a DirectedAcyclicGraph type (note that we don't even have a DirectedGraph type :) ). In particular, I can't think of any new method signatures offhand that such an implementation would have.

Can you say more about what you would want out of this?

liach commented 7 years ago

Provide a list to check whether the order of a list is possible; provide a comparator to form a list.

Bezier89 commented 7 years ago

I believe what jrtom was getting at is that you don't need that information encoded in the type (not to mention, ImmutableDirectedAcyclicGraph is a mouthful!). The current Graph interface can already be either directed or undirected (see the isDirected() method). Additionally, it can be acyclic, so it should satisfy your use case. However, something like "Acylic" is unlikely to be part of the interface type, since the methods an "Acyclic" graph would expose would be identical to a normal graph. We could potentially add an allowsCycles() method similar to allowsSelfLoops(), but that has some pretty hefty performance implications, per above discussion.

jrtom commented 7 years ago

@liach I'm not sure what you are referring to here when you say "a list", and even less sure what you mean by "Provide a list to check whether the order of a list is possible".

If what you're saying is that you want to be able to order the nodes, you can already do that with a comparator that you provide or according to their natural order if they are Comparable (and that doesn't have anything to do with whether the graph has any cycles or not).

liach commented 7 years ago
jrtom commented 7 years ago

@liach Your proposals are quite specialized. I would be surprised if there were very many people that would need what you're describing, so they're probably not a good fit for common.graph; while I'm generally in favor of code re-use, that sounds like something that I would expect users to write for themselves.

Can you tell me about the problem that you're trying to solve that's causing you to ask for these capabilities? The first one (providing a list), in particular, seems like a strange thing to want to do.

liach commented 7 years ago

Well, those are just derived from the nature of DAG. I actually want to ask @jbduncan how DAG would be used in his project, since he proposed this first.

jbduncan commented 7 years ago

Ah, okay, I was kind of hoping no-one would ask. As it turns out, after thinking about it some more and discussing it with my supervisor, I realised I didn't need a DAG after all - just a regular tree! So I'm actually not using a DAG in my project anymore. :P

I originally thought I needed a DAG because I imagined there would be scenarios where two directed branches moving outwards from the root of my "tree of messages" (https://github.com/google/guava/issues/2411#issuecomment-193399619) would need to merge back together into a single directed branch. If that were the case, a DAG would've let me model this easily, but soon afterwards I realised that branches never do merge back together (for technical reason(s) which aren't easy for me to pin down).

So I'm very sorry to say that I don't actually have anything useful to say @liach :/

jrtom commented 7 years ago

@liach I can't think of a single occasion in which I've ever seen someone working with DAGs want to be able to answer the question "can the nodes in this DAG be traversed in this order?". So, yes, it's a question that you could ask, but that doesn't mean that it's something that's generally useful. :)

Based on the research that @Bezier89 did, I am strongly inclined to suggest that DAGs, in particular, not be directly supported: if you want to check to see whether a graph that you've already built is a DAG, then you can easily do that.

The thing about how constraints such as "no parallel edges" (etc.) work is that they can easily and cheaply be checked as mutations happen; the acyclicity constraint can't be (as far as we know).

I mean, I suppose there is one way we could do this that would be reasonably efficient:

This would still be inefficient if someone called hasCycles() while the graph was being built, but if the primary use case is enabling efficient checks of this state, that would do it, and we could document this drawback.

The problem with this is that it would add complexity to the implementations (and additional surface area to the interfaces), and it would present a problem for people who wanted to implement these interfaces themselves: if they didn't essentially recapitulate this logic, hasCycles() would not reflect the cached state and would have the same cost every time. (The other alternative would be to put the hasCycles Boolean in the abstract classes themselves; I'd prefer not to add anything to those classes unnecessarily, but that would at least let the 'default' implementations be efficient as long as you were subclassing the appropriate Abstract class, and doing the right thing in the mutation implementations.)

Just to clarify: this API change would not come with an associated *Builder constraint, i.e., we would not provide any way to guarantee that cycle-inducing edges would be rejected (as self-loops are if allowsSelfLoops() is set to false in the Builder). We would only be providing a built-in way to check if a graph had cycles, that would not necessarily require a relatively expensive check of the entire graph.

So that's an approach that we could take, but I'm not yet convinced that it's the right thing to do. Whether this added complexity would be worth it would depend on how widely used it would be.

liach commented 7 years ago

In package collect, there is a class called SortedLists which assume input Lists for the methods in the class are all sorted. Is it possible to create a helper class like that one, probably called DirectedAcyclicGraphs?

liach commented 7 years ago

For the default hasCycle implementation, we can use Java 8 default methods for interfaces.

jrtom commented 7 years ago

@liach Technically we could, but we're unlikely to because we want to continue to be able to support Java 7, and we'd like to avoid drift between the Java 8+ and Java 7 versions.

Putting things in the Abstract implementations works fine. The overall issue is just a question of how much utility we'd be providing, and to how many users.

To be clear, the only way in which this helps at all is if you're going to be calling a few different methods that each want to ensure that they're being called on acyclic graphs, and it's not convenient (or possible) to pass along a guarantee of acyclicity, e.g.:

void A(Graph<N> graph) {
  Preconditions.checkArgument(!Graphs.hasCycles(graph));
  ...
  B(graph);
}

void B(Graph<N> graph) {
  Preconditions.checkArgument(!Graphs.hasCycles(graph));
  ...
}

If you're only going to be calling B (and only once), then you don't gain anything by the proposal above, since you do the cycle check once in that case. If you're calling A, and you don't own B (so you can't provide a method B-prime that skips the hasCycles check), then at that point you have to do the hasCycles() check twice and being able to cache the answer starts to become important.

liach commented 7 years ago

So is a class called AcyclicGraphs possible to prevent such duplicates? Users need to guarantee that their graph does not have cycles before calling methods in that class.

jrtom commented 7 years ago

@liach Not sure exactly what you're proposing. Are you suggesting a utility class with methods for graphs that are presumed to be acyclic (so they don't need to call Graphs.hasCycles())?

liach commented 7 years ago

Yes

jrtom commented 7 years ago

@liach That's not really in scope for Guava, and doesn't generalize well. common.graph doesn't plan to provide a wide array of methods that operate on DAGs; its purpose is primarily to provide a common language for working with graphs, not to be a full-service graph library like JUNG or JGraphT.

Stepping back a bit:

Right now, someone that wants to write code that operates only on acyclic graphs can, and should, document that it fails (has an infinite loop, gives nonsensical answers, throws an exception, etc.) if given a graph with cycles in it.

The only way to avoid that would be to provide a compile-time guarantee that the graph was acyclic, which we do not plan to do: common.graph doesn't even provide compile-time guarantees that a graph is (or isn't) directed! We gave a lot of thought to how to handle types and constraints, and finally came to the conclusion that having compile-time guarantees for anything more than the minimum (mutability, different graph types) would cause a combinatorial type explosion, and opted for providing methods (e.g., isDirected()) that allow for run-time type-checking.

So the farthest that we might go would be to provide another such method (hasCycles()) that would allow for such run-time type-checking. And in that case you would still need to document that the method would reject a graph with cycles; we'd just be providing the implementors of such methods with a way of doing an up-front check that caches the answer for a given graph in the graph after the question is first asked. (As opposed to requiring the user to cache the answer.)

liach commented 7 years ago

Indeed. DAG will hang on. But how about topological ordering functionality? That one is commonly used.

jrtom commented 7 years ago

It's likely that we will provide support for topological ordering, although that's an entirely separate issue from providing DAG support; if you want that to happen, please open a separate issue.

We're still figuring out what the boundaries of common.graph should be, though.

liach commented 7 years ago

I mention topological ordering because it is based on DAG. I will open a separate issue.

jrtom commented 7 years ago

Well, topological ordering isn't actually DAG-related: you can generate a topological ordering for a graph regardless of whether it's directed, undirected, cyclic, or acyclic.

But by all means do open a separate issue. :)

liach commented 7 years ago

It is actually DAG related. See Wikipedia: image

jrtom commented 7 years ago

@liach You're right, I'm not sure what I was thinking.

Thanks for filing a separate issue.

In any event, so far as DAG support is concerned, I think that the closest that we have to an actual proposal that we might consider doing would be what I outlined in the comment above: https://github.com/google/guava/issues/2609#issuecomment-260071602

jrtom commented 6 years ago

Closing this as out of scope for common.graph, if not 100% infeasible (see the discussion above).

jbduncan commented 6 years ago

That sounds fair enough to me. Thanks for taking the time to contribute to the earlier discussion @jrtom!

oliviercailloux commented 5 years ago

I would love to see #2609 comment implemented. I like to fail-fast, and this involves systematically checking whether a graph is acyclic on the boundaries of my API. This may imply a huge performance problem for my users.

To be concrete, imagine that a user calls my API repeatedly in a loop with a set of graphs that the user knows by construction are acyclic (e.g., a given graph is mutated between calls only by removing edges).

I can’t think of a reasonably easy and elegant way of supporting that use case with the current common.graph API while having acceptable performance (and failing fast if necessary).

This justifies the proposed addition to the common.graph API, IMHO.

liach commented 5 years ago

you've misunderstood the statement there. jrtom already clarified that jbduncan interpreted the document incorrectly. it's too expensive to fail fast, but failing on a final sort and listing strongly connected components is more feasible.

oliviercailloux commented 5 years ago

Well, the comment I refer to states: “that's an approach that we could take, but I'm not yet convinced that it's the right thing to do. Whether this added complexity would be worth it would depend on how widely used it would be.” I’m talking about that approach.

It’s not clear to me what document you refer to; if it’s about the research paper, my comment is unrelated to this paper.

To be clear, I am in favor of adding hasCycles() to the Graph and Network interfaces as suggested in the comment referred to.

jrtom commented 5 years ago

@oliviercailloux Are you writing a general-purpose library, or are you trying to solve a problem for your own code?

The approach outlined in this comment above is something that you can do by creating a class that wraps a graph object and also adds this hasCycles() method. It's true that this would be easier if we provided a ForwardingGraph class (which we might at some point to facilitate minor extensions like this) but it's not difficult.

A good argument that we should add this capability to common.graph itself would depend on the existence of a compelling (and relatively common) use case--better still, several of them--in which one would need to repeatedly interrogate the same graph (or its subgraphs) as to whether it had cycles, and in which it was not practical (and/or appropriate) to cache this state. So if you have such cases, please share them with us.

What problem are you trying to solve?

oliviercailloux commented 5 years ago

It is a general purpose library (well, it is in the making, so currently it is nothing much usable). An object GradingExecutor receives a graph (which must be a DAG), representing a "prerequisite" relation, where nodes are tasks to be executed. The graph comes from the user of the library and is built the way she wants.

I would like to check, when receiving the graph, that it is a DAG indeed, especially because it is unfortunately not clear from its type that it should be a DAG. (Of course it is conceptually clear when you think about it, and it is documented, but it is easy to overlook occasionally I think.)

I would prefer to avoid creating a wrapper graph class for this use case and stick to accepting standard google.common Graph instances. Assuming my user already has some logic in place using that library, she can then directly pass those instances to me, instead of wrapping them first in a new kind unfamiliar to her. That’s one important benefit I see in using (hopefully) standard libraries rather than each-library-its-own-types-of-graphs.

Now, to be honest, this is probably not the kind of compelling example you are looking for. Currently I run a topological sort of the graph upon receiving it, so checking for acyclicity has a very low supplementary cost (but it will likely change in the future when the library becomes more full-featured). And executing the tasks, in my case, will likely take much longer than any of these things (but not surely). And it is not very likely, I admit, that the user would repeatedly give an Executor slightly different graph instances, as I initially thought (but I suppose it could happen).

Still, I would find this added capability nice to have for more elegant and conceptually correct code. As a library writer, I would prefer to be sure I do not harm my user by trying to fail fast, including if she’s in some edge case.