Support resuming failed execution of a task graph

I propose a Docker Image like versioned state + step identification approach.

After dependencies calculation i'll arrive to a certain execution order of the tasks. Let's consider it is the following, and i identify every step with the following notation:

A id= "A"
B id= "AB"
C id= "ABC"
D id= "ABCD"
E id= "ABCDE"

While running this sequence i could store the ids (or a hashed version of them) of all run steps, more than this i could also store the internal state of the graph at every step by serializing it with pickle and identifying it with the id.

So let's assume that task C is bad implemented.

We run the tree
At every successfully ran task i store "somewhere" in a ordered list the id previously mentioned and the related state (pickle of tree) and i persist in some way this data incrementally.
When the run fails i've stored the last successful state and data (versioned)
Fix my code
Load previously stored data
Run again the script but at every running task i check if it's contained in "done" ids, and i skip the "calculation" part of the run until i reach the first "undone" task and i continue running.

OBS: to test always the same chain the "calculated" order of tasks has to be unique or forced in some way using the history. OBS2: it's a docker-like approach, docker images generate an hash for every step run, in this way the storage of successive buildings includes only the binary delta from the previously unmodified script. OBS3: obviously, when i find an "undone" task i create a new history removing all the steps that are after the first undone in the older history (loaded in memory).

Pro: It allows to start again from any point and to store data incrementally Cons: It consumes a lot of space for the versioning of the graph.

trustyou / dagger

Support resuming failed execution of a task graph #6