radiasoft / sirepo

Sirepo is a framework for scientific cloud computing. Try it out!
https://sirepo.com
Apache License 2.0
64 stars 31 forks source link

Reconnect agents instead of killing them? #3402

Open robnagler opened 3 years ago

robnagler commented 3 years ago

At BNL, they often run long simulations. Given that the supervisor/agent protocol is stable, I think we could have the agent reconnect after a release, instead of killing the agents. All agents could be asked to stop, but only agents didn't have jobs, would stop. The agent would keep a reminder to stop after it quiesced so that agents get the latest software when they are ready. This of course would be really useful for NERSC, and this is basically why we designed the agent protocol the way we did (websocket in vs ssh out).

From the supervisor's perspective, there would be more management, because it's a tricky problem to know if an agent is stuck vs busy (halting problem). Rather, we would need a keep alive protocol to ensure the agent is busy and not stuck.

The agent would need work, too.

robnagler commented 2 years ago

Note that if there is an agent configuration change (e.g. adding more memory), this has to be detected and the agents restarted when they are done with what they are doing.