Clarification on allocs, runs and how to have long-running execs

Thank you for open sourcing this.

I have been playing with reflow a bit, and I want to give it a go. I understand that to keep things simple, the reflow command-line tool is only an interpreter running in the foreground. I am sure it is possible with what you have open sourced, but I have trouble finding out how to keep alive long-running execs, partly because I am not clear on what allocs, runs and execs exactly are and how they are persisted across sessions.

I have jobs that might take two days to run. To allow that, the first thing I did, with what you open sourced, is to run reflow as a systemd unit. This way, reflow is kept alive unless 1) there is an error in the workflow (if I understand correctly, there is no "--keep_going" as of option today), 2) reflow itself panics, 3) there is not enough resource for the reflow tool to run properly. The problem I have is that when reflow restarts, the ongoing execs are replaced with new execs and collected after a while.

Given the system in place, I'm sure that a very simple workaround exists. It must revolve around what allocs, runs and execs exactly are and how they are persisted across sessions. However, the solution eludes me. I have tried to no avail: 1) to reuse the allocation with "reflow run --alloc". 2) to hack in a --run-id argument to "reflow run" to reuse the run ID.

Thank you again for this work.

grailbio / reflow

Clarification on allocs, runs and how to have long-running execs #27