Improve graphlib.TopologicalSort by removing the prepare step

python / cpython

The Python programming language

https://www.python.org

Other

62.59k stars 30.03k forks source link

Improve graphlib.TopologicalSort by removing the prepare step #91301

Open larryhastings opened 2 years ago

larryhastings commented 2 years ago

BPO	47145
Nosy	@tim-one, @larryhastings, @ericvsmith, @pablogsal, @sweeneyde

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = 'https://github.com/larryhastings' closed_at = None created_at = labels = ['type-feature', 'library', '3.11'] title = 'Improve graphlib.TopologicalSort by removing the prepare step' updated_at = user = 'https://github.com/larryhastings' ``` bugs.python.org fields: ```python activity = actor = 'larry' assignee = 'larry' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'larry' dependencies = [] files = [] hgrepos = [] issue_num = 47145 keywords = [] message_count = 12.0 messages = ['416182', '416306', '416321', '416322', '416323', '416324', '416404', '416405', '416407', '416408', '416592', '416594'] nosy_count = 5.0 nosy_names = ['tim.peters', 'larry', 'eric.smith', 'pablogsal', 'Dennis Sweeney'] pr_nums = [] priority = 'normal' resolution = None stage = 'needs patch' status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue47145' versions = ['Python 3.11'] ```

larryhastings commented 2 years ago

I've maintained my own topological sort class for a while now. The API I came up with (or was it Tim and I?) was adopted for graphlib.TopologicalSort(). Some of my method names are slightly different, but the APIs are so similar, I can do a little renaming and run Lib/test/test_graphlib using my implementation. (For clarity I'm going to write the rest of this issue using the names from graphlib.)

Recently I made an improvement to my version: it no longer has the modal behavior where there's a "before prepare" phase where you add nodes and an "after prepare" phase where you can call get_ready and done.

Instead, in my new and improved version the graph is always maintained in a "ready" state. You can call add, get_ready, and done in any order, as much as you like. prepare doesn't do anything besides the cycle check--but read on!

This approach required tweaking some semantics behind the scenes. My graph object now maintains an internal "dirty" flag. It's set to True after any edges are added, and only gets cleared after checking for cycles. prepare runs this cycle check, but get_ready *also* runs this cycle check. (Though in both cases, only when the dirty flag is set, of course.) In theory this means you no longer need the prepare call. However, it's still useful, because you can call it to check for a CycleError.

So, on the one hand, this means that get_ready can throw a CycleError, which it never did before. On the other hand, it won't throw for existing users of the library, because they never add edges after their prepare call. So this semantic change shouldn't break existing users, assuming they're not misusing the existing API.

Another wrinkle: adding a dependency on a node that's already been handed out by get_ready (but hasn't been returned by done) is ignored. Again, not something existing users of graphlib can do, so this shouldn't break code.

There's one final semantic change I made worth knowing about: when you return a node via done, the graph simply forgets about it. This means you could add it a second time (!) and it could go for a complete ride through the graph again, wheeee! This also means that when all nodes have been returned by done, the graph is essentially returned to its initial pristine state. This too seems completely harmless, because with the existing library it's illegal to add nodes to a graph after calling prepare, so nobody's doing it, so it won't break any code.

I'm happy to contribute my version. But I was lazy and didn't worry about speed or memory implementation; it's strewn with sets and things. I figured I could always optimize it later. But "later" could become "now" if we were interested in trying to merge this behavior into Python's graphlib.

Interested?

ericvsmith commented 2 years ago

My personal usage of a topological sort are restricted to maybe 100 entries max, so I can't really test this in any meaningful way.

That, and the project that uses it is stuck on 3.6, so I haven't switched to the graphlib version yet.

sweeneyde commented 2 years ago

Out of curiosity, what are the use cases for adding nodes after get_ready has already produced nodes?

I was wondering about avoiding the need to call prepare() by having it automatically do the cycle-checking at the first get_ready() call and then raising ValueError if add() is called any time thereafter.

Assuming we do want to be able to add() after a get_ready(), is there a reason that "forgetting" already-produced nodes is the correct behavior, as opposed to remembering all nodes ever added, and raising iff the addition creates a cycle among all nodes ever added or depends on an already-yielded node?

sweeneyde commented 2 years ago

depends on an already-yielded node

I mean "creates a new not-yet-yielded dependency for an already-yielded node".

larryhastings commented 2 years ago

I'm using my graph library to manage a list of tasks that need doing in some sort of proper order. One task may spawn other tasks at runtime, and we don't necessarily know what the tasks will be until runtime. It's way more convenient to simply add such tasks on demand, rather than trying to preemptively pre-add all such possible tasks before preparing the graph, or creating additional graphs.

For example, consider a tool that downloads and digests zip files full of music from an online store. Your initial tasks might represent "download the zip file", then "decompress and examine the contents of the zip file". You could then iterate over the contents of the zip file, adding different tasks based on what you find--one pipeline of tasks for media files (FLAC/MP3/OGG/etc), another for the playlist, a third if you don't *find* a playlist, a fourth for image files, etc. (Not that you'd necessarily write such a tool this way, but it's at least plausible.)

The new nodes needn't be connected to the existing nodes for this to still be useful. You could reuse the same graph object for all your tasks.

larryhastings commented 2 years ago

Assuming we do want to be able to add() after a get_ready(), is there a reason that "forgetting" already-produced nodes is the correct behavior, as opposed to remembering all nodes ever added, and raising iff the addition creates a cycle among all nodes ever added or depends on an already-yielded node?

I'm not sure "correct" applies here, because I don't have a sense that one behavior is conceptually more correct than the other. But in implementing my API change, forgetting about the done nodes seemed more practical.

The "benefit" to remembering done nodes: the library can ignore dependencies to them in the future, forever. Adding a dependency to a node that's already been marked as "done" doesn't make much conceptual sense to me, but as a practical thing maybe it's useful? I'm not sure, but it doesn't seem all that useful.

I can only come up with a marginal reason why remembering done nodes is useful. Let's say all your tasks fan out from some fundamental task, like "download master list of work" or something. One of your tasks might discover additional tasks that need to run, and conceptually those tasks might depend on your "download master list of work" task. If the graph remembers the done list forever, then adding that dependency is harmless. If the graph forgets about done nodes, then adding that dependency could re-introduce that task to the graph, which could goof things up. So maybe it's a little bit of a footgun? But on the other hand: you already know you're running, and you're a task that was dependent on the master list of work, which means you implicitly know that dependency has been met. So just skip adding the redundant dependency and you're fine.

On the other hand, forgetting about the nodes has a definite practical benefit: the graph consumes less memory. If you use a graph object for a long time, the list of done nodes it's holding references to would continue to grow and grow and grow. If we forget about done nodes, we free up all that memory, and done membership testing maybe gets faster.

I guess I'm not married to the behavior. If someone had a great conceptual or practical reason why remembering the done nodes forever was better, I'd be willing to listen to reason.

larryhastings commented 2 years ago

Having slept on it, I think the footgun is slightly bigger than I gave it credit for. Also the problem of nodes lingering forever is easily solved: give the user control.

My next iteration will keep the done nodes around, but I'll also add a forget() method that compels the graph to forget all done nodes.

tim-one commented 2 years ago

Various kinds of tasks:

"Power switch must be on." Needs to done the first time. _May_ need to be done again later (if some later task turns the power off again). Can be done any number of times without harm (beyond the expense of checking), so long as the switch is on.
"Ensure gas tank is full." Probably needs to be done anew for every new added task that depends on it.
"Remove outermost layer of skin." Probably shouldn't be done more than once ;-)
"Flip Najdorf switch." Who knows? Doing it a second time - or failing to do it a second time - may be necessary, harmless, or deadly.

I'd rather not bother with any of this dicey guessing. While some dynamism may be attractive, what it all "should mean" appears to be a rat's nest, and depends on domain knowledge graphlib can't possibly have.

I doubt there is a compelling default. As is, two tasks are considered to be "the same" task if and only if they compare equal, so that's the least _surprising_ default for tasks added later. "Remove outermost layer of skin"

"Two tasks that compare equal may or may not be considered the same task, depending on the execution history at the time the question is posed" is at best expedient, at worst disastrous.

larryhastings commented 2 years ago

I'm not sure I follow you. What do you suggest graphlib is guessing at? I thought we were discussing what graphlib's (documented, intentional) behavior should be; if there was any guessing going on, it was us doing it, guessing at what behavior the user would prefer.

I appreciate you corresponding on the issue, but I'm having difficulty understanding what light you're shining on the problem.

tim-one commented 2 years ago

I believe I'm elaborating on your "footgun".

It doesn't matter to me whether we pick some scheme and document it, _if_ that scheme is incoherent, impossible to remember, error-prone, etc.

That's how, e.g., regular expression syntax was designed. "I know! ++*??+) doesn't have a coherent meaning now, so let's say it means to match a prime number of the pattern it follows" ;-)

That is, an error-prone extension is worse than no extension at all.

The algorithm now isn't guessing at anything: it's saying up front that two tasks are the same task if and only if they compare equal. Context, execution history, ..., nothing else is relevant. It's simple. Complicating a simple thing may open new possibilities, but creates new traps too.

One trap is pretty obviously making "the rules" for when two tasks are the same task depend on the execution history at the time the question is asked.

That goes away, though, if the current rule is retained ("the same iff =="), but can be explicitly overridden by .forget() (or some such).

That doesn't make it a no-brainer, though. For example, do

.add(A, B)

and run until A and B are marked done. Now do

.add(B, C)

What then? We're back in a guessing game again. We've just been told that B depends on C first. But we already did B. Depending on _domain_ knowledge we cannot have, that may or may not be worthy of raising an exception.

You can make up any rule you want about that and arbitrarily declare victory, but the chance that a user will realize it's not the behavior _their_ domain needs is approximately 0 until after they've been burned by it. So now it's error-prone, at least to them.

FWIW, in that specific case I'd say "tough beans - you told us too late that B depends on C to stop B the first time, but now that you've told us we'll respect it in future".

Another twist: assuming that's "the rule", what does

.add(B, C)

really mean? If B really is "the same task" as it was the first time around, well, it's already been marked done. Are we supposed to do it _again_ now? Why? Why not?

It's hard for me to find any _compelling_ sense here - just masses of more-or-less arbitrary implementation decisions. In which case "hard to remember" follows soon after.

None of that haunts the current API. It all follows quite directly from what "depends on" means to essentially everyone.

larryhastings commented 2 years ago

I agree that the API should have as few surprises as possible. AFAICT you haven't made any terminal pronouncements like "it's impossible to add this feature without too many unacceptable surprises", so I'll proceed assuming we can find an API (and semantics) that we're all happy with.

I've come around on the idea that forgetting "done" nodes is too surprising. The least surprising behavior is: once we've added a node to the graph, the graph remembers it.

Now I'll propose a second simple rule, which you've already touched on: you can't add a dependency after a node has been returned by get_ready(). Attempting this always throws an exception; which exception specifically TBD. "Tough beans".

I think those two rules are easy enough to remember and to reason about. It meaningfully expands the utility of the library with minimal surprise. The library behaves identically for users of the existing API but permits new use cases too. Adding this functionality would therefore mean fewer users would discover too late their use case isn't supported.

As far as my "long-lived graph objects will consume more and more memory over time" caveat, there's a better solution than "forget": graph.remove(*nodes). (My version already has a "remove" method, and I forgot that graphlib's doesn't.) Allowing the user to remove a node from the graph gives them explicit control, and the semantics should be obvious and unsurprising; if you--the user--remove a node, and later you--the user--re-add that node to the graph, it behaves identically to any other node the graph has never seen before. Removing a node intuitively removes all edges to that node.

Two notes on "remove" if we decide to go that route. First, I'd ensure you can remove a node at any time. Nodes have three externally visible states wrt TopologicalSort:

1) added but not published by get_ready, 2) published by get_ready but not returned using done, and 3) done. You should be able to remove a node in any of those three states.

Removing a node in 2) should be equivalent to calling done before calling remove; that is, if you're removing the node anyway, you don't need to call done.

Second, the current underlying implementation would make remove really slow. Nodes don't remember their predecessors, only their successors, so removing a node would be O(n). If we expected remove to get a lot of use, we'd probably want to change how the graph is stored to speed up this operation.

larryhastings commented 2 years ago

One final aside. Let me preface this by saying: I'm not proposing the following for graphlib.TopologicalSort. It's obviously too late to change that object this much. It's just something I'm thinking about--maybe I'll use this in my own library.

Where we are now, the graphlib.TopologicalSort object is simultaneously a static representation of a graph--which the object doesn't provide a public API for you to examine!--and a stateful single-use iterator over that graph. My proposed change to the API seems to increase the tension between these two sets of semantics. Perhaps a better set of semantics, that more successfully maps my use case to the graph object, would be as follows:

* you can add() nodes and edges at any time.
* get_ready() always returns the current list of nodes with no prececessors.  There is no internal "we already told you about this one" state.
* replace done() with remove(), which removes the node and all edges from/to it from the graph.
* static_order() is still fine.

I think this would make it easy to reason about the object's behavior, and would be a better match to my use case where you're continually adding (and removing?) nodes, not just during an initial "populate the graph" phase.

Again, not appropriate to consider for graphlib.TopologicalSort.

larryhastings commented 2 years ago

I've continued thinking about it. Here's a summary of what I want to do.

Remove the requirement that the user prepare() the graph before calling get_ready() (or done()).
Remove the constraint that add() can't be called after calling prepare(). Adding a dependency on a node that's already been returned by get_ready() is a no-op. (It doesn't matter whether or not that node was subsequently returned with done().)
When adding an edge to the graph, the graph internally sets a "dirty" flag that's only cleared when you do a cycle check. prepare() explicitly calls the cycle check (and clears the dirty bit), but so do get_ready() and done().
- This means it's now possible for get_ready() and done() to throw a cycle detected exception. However, that only happens if edges are added after calling prepare(), which isn't possible with the current API. So this won't happen to existing use of the library, so we won't break existing code.
- In this new API, prepare() only performs a cycle check (when the graph is dirty). So I propose that prepare() be an alias for a new method check_for_cycles(). We could theoretically deprecate the name prepare() but I don't think we should bother.
Add a remove(*nodes) method. This removes all the listed nodes from the graph. All edges from and to each node are also removed.
- You can remove a node at any time.
- Removing a node that's not in the graph raises ValueError.
- Calling add() on a node that was previously removed behaves identically to adding any other new node. (Removing a node removes all state associated with that node.)
Move the get_ready() and done() calls into their own separate "iterator" object, and allow the user to create new "iterators".
- This isn't an "iterator" in the Python-protocol sense; it doesn't have a __next__() method. I agree that "iterator" is a bad name. This is just a temporary term until I find a better one, which I'll do before commit.
- You can create a new "iterator" at any time by calling the iterator() method on the TopologicalSort object.
- The iterator has two methods: get_ready() and done().
- The TopologicalSort object still has get_ready() and done() calls, but these become passthroughs to an internally pre-allocated iterator object. (This is similar to the random module and its "hidden instance of the random.Random class".)
- Every iterator independently iterates over the entire graph.
- There is no prepare() step necessary for an iterator object. Every iterator object is always maintained in a ready state.
- When you add or remove nodes or edges to a graph, all iterators on that graph see the changes to the graph immediately. (Contrast this with these alternate, but discarded, plausible semantics: "an existing iterator remembers the state of the graph when it was created, and yields the nodes based on that initial graph", or "if the graph changes while the iterator is iterating, attempting to use the iterator throws an exception, like modifying a dict or list while iterating over it".)

I don't propose any semantic changes to is_active() or static_order().

I suspect these changes will require reworking the implementation. Removing a node with the current implementation would be slow, and the code for the new iterator will be very different from the existing get_ready() and done() calls under the hood.

tim-one commented 2 years ago

Still seems to me a large pile of arbitrary implementation decisions not driven by a coherent rationale.

For example,

Adding a dependency on a node that's already been returned by get_ready() is a no-op. (It doesn't matter whether or not that node was subsequently returned with done().)

seems not only arbitrary, but wrong-headed. Like "open the garage door" was already handed out but not yet done. Later "drive to the store" is added, depending on "open the garage door". If it was added before "open the garage door" was handed out, presumably the implementation would respect that we don't want to drive the car through the garage door. But, nope, if we happened to add it after, we'll merrily crash through the door because, above, it's arbitrarily defined "a no-op".

I don't want to "fix that", either - topsort is a very widely known and well-understood algorithm, and extending it to dynamically mutating graphs holds no appeal for me. If you want some kind of real-time dynamic task management system, build one directly? Raw dependency is just one thing such systems need. They also need, e.g., facilities to specify priorities, associated costs, assignment of tasks to work units, notions of deadlines, possible time delay after a task completes before a given dependent can start, critical path analysis and reports, on and on.

Yup, they need an internal topsort too, but that's just a tiny part of what they need. Sure, you could graft all that on top of a class named TopologicalSorter - and we could graft a tax preparation system on top of math.fsum() with enough new 100% backward-compatible optional arguments :wink:.

BTW, the closest thing now to your "iterators" appears to be the new-ish "view objects" returned by dict.keys() and friends.

larryhastings commented 2 years ago

I am indeed building a simple dynamic task management system, so I have a use case for this. I can just implement it in my own copy, and I probably will anyway. But it seemed like it might be useful for other people. So I proposed modifying the API and implementation as you see above. Comedy jokes aside, I don't think this is analogous to grafting a tax preparation system on top of math.fsum(); this is still a topological sorter, and existing users of it won't have to change a line of code to continue using it as they have done. The difference is, now the object permits mutating the graph after we've started traversing it.

Also, you're right, I got the semantics wrong when I wrote them out above, sorry about that. Adding a dependency on a node that has been marked "done" is a no-op. Adding a dependency to a node returned by get_ready() behaves the same as adding a dependency on a node when initially building the graph.

vdwees commented 2 years ago

@larryhastings I also made one of these to replace airflow, we needed fast, event-driven (asyncio) scheduling of tasks in a DAG. Perhaps this is something we can collaborate on?

larryhastings commented 2 years ago

(A quick note before we get started. What graphlib calls get_ready, I just call ready in my version. But I also alias get_ready to ready for compatibility reasons. Anyway I mention it in case I wrote ready somewhere below--assume I meant get_ready. I'll try to get it right though.)

I've coded mine up. It's too early to propose it as a PR, so I've simply attached it in a zip file. (GitHub won't let me attach the .py files directly to the issue.) It seems to work fine, though I might have missed something. But, not only did I write my own tests, I also reuse the existing graphlib test suite, and it passes everything. (Though I had to elide two tests that don't make sense anymore given the API changes.)

My implementation is here: larryhastings.graph.rewrite.zip

First, I've indeed named my "iterator" object a "view"--that's definitely the right API metaphor. I appreciate the suggestion.

Having now re-thought the API from the perspective of separating the "view" from the "graph", the overall API metaphor became clear:

You can add nodes and edges to the graph at any time. Adding nodes and edges always succeeds.
You can now also remove nodes from graph g with the new method g.remove(node). Again, you can do this at any time. Removing a node from the graph always succeeds, assuming the node is in the graph.
You can iterate over the graph at any time using a "view". View objects implement the get_ready, done, and __bool__ methods. There's a default view built in to the graph object, which passes its get_ready and done and __bool__ methods through to the default view. You can also create a new view at any time by calling the view method.
If you attempt to use a view while the graph has a cycle, it raises a CycleError.
If you're using a view to iterate over the graph, and you modify the graph, and the view now represents a state that isn't coherent with the graph, attempting to use that view raises a RuntimeError. (I don't have a strong opinion about what exception it should raise; I picked that one because that's what a dict iterator raises if you modify the dict during iteration.) More on what I mean by "coherence" in a minute.

In addition to the following new features:

prepare is now optional, though it still performs a cycle check.
You can modify the graph at any time, even while iterating over it.

This approach also fixes some minor warts with the existing API:

static_order and get_ready/done are mutually exclusive. If you ever call get_ready on a graph, you can never call static_order, and vice-versa.
You can only iterate over the graph once, or call static_order once.

Happily, the second wart would be easily fixed in the existing TopologicalSort by adding a simple reset method. This would reset the state of the iterator, back to how it was immediately after calling prepare. If I don't get this change accepted, I'd be happy to contribute this feature separately. And yes I've already added this reset method to my view object.

The second wart is harder to fix; that requires splitting out the "view" logic into its own separate object, in much the same way I did here. If I don't get this change accepted, I'd be happy to contribute that change too, if we want it. And yes, in my code, static_order creates its own view, consumes it, and discards it. This means static_order doesn't interfere with the built-in view, which I consider a bugfix.

So what does it mean for a view to no longer be coherent with the graph? Let's say we have a graph g. We add an edge:

g.add('B', 'A')

'B' now depends on 'A' in the graph.

Next let's consider a view v. What if v has already been iterating over the graph, and in view v, B has already been returned by get_ready, but A hasn't yet been returned by get_ready? That doesn't make sense. B can't have become ready before A. So view v is no longer coherent, and any attempt to interact with v raises an exception.

To state it more precisely: if view v is a view on graph g, and you call g.add('B', 'A'), and neither of these statements is true in view v:

'A' has been marked as done.
'B' has not yet been yielded by get_ready.

then v is no longer "coherent".

(If 'A' has been marked as done, then it's okay to make 'B' dependent on 'A' regardless of what state 'B' is in. Likewise, if 'B' hasn't been yielded by get_ready yet, then it's okay to make 'B' dependent on 'A' regardless of what state 'A' is in.)

Note that you can restore a view to coherence. In this case, removing either A or B from g would resolve the incoherence between v and g, and v would start working again.

Also note that you can have multiple views, in various states of iteration, and by modifying the graph you may cause some to become incoherent but not others. The views are completely independent of each other.

p.s. maybe there's a better word than "coherent"? I don't normally see it used in this way, with something being described as "coherent to" or "coherent with" something else.

vdwees commented 2 years ago

I'm just a Labrador retriever on the internet, my opinions are not necessarily representative.

easily fixed in the existing TopologicalSorter by adding a simple reset method. This would reset the state of the iterator, back to how it was immediately after calling prepare

I had not encountered any other "one-time-use-only" classes in the standard library before, and was surprised when I realized the API did not support this. If you need a reusable graph data structure, the workaround seems to be storing the graph externally to TopologicalSorter and pass it in as the constructor's optional graph arg every time you want to iterate. In a sense, the TopologicalSorter already is the View/Iterator of the graph (albeit a static one), and the real graph data structure is actually just some dict. (Maybe there is a case for adding a generic stateless graph data structure to graphlib?)

...in my code, static_order creates its own view, consumes it, and discards it. This means static_order doesn't interfere with the built-in view, which I consider a bugfix.

Agree that the existing behavior could be considered a bug. Certainly I found it rather confusing/surprising anyway. The only way I was able to get TopologicalSorter working for me was by reading the source code (luckily highly readable).

The coherence concept is pretty interesting. Personally I don't have a use case for it yet, my use case is more along the lines of simply adding new downstream nodes, or replacing a downstream node with a subgraph of nodes.

larryhastings commented 2 years ago

I don't have a "use" per se for views becoming incoherent either. But it's a situation that could arise, and the library needs a sane way to deal with it.