Closed hchauvin closed 6 years ago
Ok so I think now this is a non-problem.
My goal for dissociating the run from the invocation of reflow was to remove what I considered a single point of failure. Too bad if after two days of calculation your job dies because the controller died! However, the job itself is another single point of failure. So the best way to solve this might be to run two reflow instances on two different machines, so that two jobs are spun off on two different allocations. I don't how the details on how one job "wins" over the other if both succeeds, but it seems to work.
Thank you for open sourcing this.
I have been playing with reflow a bit, and I want to give it a go. I understand that to keep things simple, the reflow command-line tool is only an interpreter running in the foreground. I am sure it is possible with what you have open sourced, but I have trouble finding out how to keep alive long-running execs, partly because I am not clear on what allocs, runs and execs exactly are and how they are persisted across sessions.
I have jobs that might take two days to run. To allow that, the first thing I did, with what you open sourced, is to run reflow as a systemd unit. This way, reflow is kept alive unless 1) there is an error in the workflow (if I understand correctly, there is no "--keep_going" as of option today), 2) reflow itself panics, 3) there is not enough resource for the reflow tool to run properly. The problem I have is that when reflow restarts, the ongoing execs are replaced with new execs and collected after a while.
Given the system in place, I'm sure that a very simple workaround exists. It must revolve around what allocs, runs and execs exactly are and how they are persisted across sessions. However, the solution eludes me. I have tried to no avail: 1) to reuse the allocation with "reflow run --alloc". 2) to hack in a --run-id argument to "reflow run" to reuse the run ID.
Thank you again for this work.