trustyou / dagger

Dagger is a Python micro framework that performs tasks, providing you with parallel execution, and dependency resolution.
MIT License
6 stars 4 forks source link

Support resuming failed execution of a task graph #6

Open sportsracer opened 7 years ago

sportsracer commented 7 years ago

Sometimes, running many tasks takes a long time. If the graph files when it's almost done, you currently need to rerun everything.

Solution: When execution of a graph files, serialize the state and data of the task graph. Then, resume execution from that point. Note: You should be able to change the code of tasks in between failure and retry. Since bugs are many times the cause of task failure.

This needs to be well thought through wrt multiprocessing and sharing of data.

nicfix commented 7 years ago

I propose a Docker Image like versioned state + step identification approach.

After dependencies calculation i'll arrive to a certain execution order of the tasks. Let's consider it is the following, and i identify every step with the following notation:

While running this sequence i could store the ids (or a hashed version of them) of all run steps, more than this i could also store the internal state of the graph at every step by serializing it with pickle and identifying it with the id.

So let's assume that task C is bad implemented.

  1. We run the tree
  2. At every successfully ran task i store "somewhere" in a ordered list the id previously mentioned and the related state (pickle of tree) and i persist in some way this data incrementally.
  3. When the run fails i've stored the last successful state and data (versioned)
  4. Fix my code
  5. Load previously stored data
  6. Run again the script but at every running task i check if it's contained in "done" ids, and i skip the "calculation" part of the run until i reach the first "undone" task and i continue running.

OBS: to test always the same chain the "calculated" order of tasks has to be unique or forced in some way using the history. OBS2: it's a docker-like approach, docker images generate an hash for every step run, in this way the storage of successive buildings includes only the binary delta from the previously unmodified script. OBS3: obviously, when i find an "undone" task i create a new history removing all the steps that are after the first undone in the older history (loaded in memory).

Pro: It allows to start again from any point and to store data incrementally Cons: It consumes a lot of space for the versioning of the graph.