Closed lachlan2k closed 1 year ago
Reasons that an agent could be sad, and our general approach:
Agent states:
Actions that should be taken upon agent states:
When the API program starts, but before the API starts accepting connections:
The rationale for the above is that if the API process is started and stopped quickly, we want to resume like nothing ever happened. However, if the API is taken offline for an extended period, then eveything is dead, and we'll want to mark it as such.
When a job is sent to the agent:
When an agent sends a heartbeat:
When an agent connects:
State re-conciliation loop - runs periodically, or can be externally triggered on an event that might be important:
Closing as sufficient for now. Thigns like re-scheduling can be future work.
Currently, there is:
However, the following problems currently exist:
We need to avoid a split-brain problem, so I suggest something like the following:
Nice to haves:
.restore
files, given that cracking order is somewhat non-deterministic. Can we replicate this? So if a job dies 12 hours in, we can resume it on another node from the last known point, instead of starting from scratch