At BNL, they often run long simulations. Given that the supervisor/agent protocol is stable, I think we could have the agent reconnect after a release, instead of killing the agents. All agents could be asked to stop, but only agents didn't have jobs, would stop. The agent would keep a reminder to stop after it quiesced so that agents get the latest software when they are ready. This of course would be really useful for NERSC, and this is basically why we designed the agent protocol the way we did (websocket in vs ssh out).
From the supervisor's perspective, there would be more management, because it's a tricky problem to know if an agent is stuck vs busy (halting problem). Rather, we would need a keep alive protocol to ensure the agent is busy and not stuck.
Note that if there is an agent configuration change (e.g. adding more memory), this has to be detected and the agents restarted when they are done with what they are doing.
At BNL, they often run long simulations. Given that the supervisor/agent protocol is stable, I think we could have the agent reconnect after a release, instead of killing the agents. All agents could be asked to stop, but only agents didn't have jobs, would stop. The agent would keep a reminder to stop after it quiesced so that agents get the latest software when they are ready. This of course would be really useful for NERSC, and this is basically why we designed the agent protocol the way we did (websocket in vs ssh out).
From the supervisor's perspective, there would be more management, because it's a tricky problem to know if an agent is stuck vs busy (halting problem). Rather, we would need a keep alive protocol to ensure the agent is busy and not stuck.
The agent would need work, too.