Failure Condition Handling

lachlan2k commented 1 year ago

Currently, there is:

Rudimentary auto-reconnect in the agent
Agent health checks before assigning jobs

However, the following problems currently exist:

If a disconnect briefly happens and reconnects, the agent will try and send messages to the OLD websocket handle, and those won't get delivered.
There is no handling of the condition when an agent fully dies and jobs die with it.

We need to avoid a split-brain problem, so I suggest something like the following:

If an agent disconnects for more than, say, 5 minutes, then BOTH the agent and API should consider it "dead". The agent should stop all of its currently running jobs, and the API should re-assign these to another node.
If an agent has disconnected for shorter than 5 minutes, it should be considered "unhealthy", and won't have any new jobs assigned, but not "dead", so we give it a chance to recover from its hiccup.
Any disconnects shorter than 5 minutes should be considered "fine", but generate some warnings if possible. If the agent is trying to send messages whilst disconnected, it should interanlly buffer those, and send them once the websocket re-connects.
If the agent is intentioinally killed (due to reboot or whatever), instead, a graceful shutdown should happen. The agent should say "lol bye" to the server, which should immediately consider the agent unhealthy/dead.

Nice to haves:

Session resumption/restore points? We need to investigate how hashcat does it internally with .restore files, given that cracking order is somewhat non-deterministic. Can we replicate this? So if a job dies 12 hours in, we can resume it on another node from the last known point, instead of starting from scratch

lachlan2k commented 1 year ago

Reasons that an agent could be sad, and our general approach:

Conneciton was momentarily dropped: That's fine, continue as is.
Conneciton was congested with other message, i.e. maybe hashcat was spamming log lines and our heartbeat couldn't make it through: That's fine, we can accept some delays in our heartbeat, just not too long.
Agent restarted due to a panic/crash: Ideally, we want the agent to tell us when it panics (probably by having a recover, and have it yolo a crash report over the websocket if it can before it dies).

Agent states:

[x] Healthy: Agent is connected, and hearbeats are actively received.
[x] Unhealthy & connected: Connected, but no heartbeat for over 60 seconds. This state is also entered when the agent first connects (awaiting first heartbeat). Could be because collecting data from heartbeat is taking a hot minute, websockets are congested from hashcat spam, etc
[x] Unhealthy & disconnected: Disconnected for less than 60 seconds. Could be due to network failure, agent crashed, etc.
[x] Dead: Agent has been unhealthy for over 60 seconds. I.e. either 120 seconds since last heartbeat, or 60 seconds since last connection.

Actions that should be taken upon agent states:

[x] Healthy: Availble for new jobs to be scheduled.
[x] Unhealthy & connected: Do not schedule new jobs to the agent. Hold steady, assume it will come back shortly.
[x] Unhealthy & disconnected: Do not schedule new jobs to the agent. Hold steady, assume it will come back shortly.
[x] Dead: Disconnect the agent. Mark as dead in database. Re-schedule all jobs to other agents (probs not for MVP). If there is still a connection (but no hearbtbeat), attempt a hail-mary message to tell the agent to die.

When the API program starts, but before the API starts accepting connections:

[x] Mark all non-dead agents as "unhealthy & disconnected" in the database
[x] Perform a state re-conciliation (BEFORE we start the API)

The rationale for the above is that if the API process is started and stopped quickly, we want to resume like nothing ever happened. However, if the API is taken offline for an extended period, then eveything is dead, and we'll want to mark it as such.

When a job is sent to the agent:

[x] We say to the agent "hey start this job". The job is marked as "pending" in the database, with a timestamp.
[x] We expect that within 5 seconds the agent will reply with an acknowledgement that the job has started. We will then mark the job as started.

When an agent sends a heartbeat:

[x] The heartbeat shoudl contain a list of jobs the agent is running.
[x] Trigger a state re-conciliation

When an agent connects:

[x] Mark as "unhealthy & connected"

State re-conciliation loop - runs periodically, or can be externally triggered on an event that might be important:

[x] Look at all agents. Evaluate their health, per the list of states.
[x] Look at all unhealthy agents. If they have been unhealthy for over 60 seconds, mark them as dead,
- [x] Mark all jobs they were running as failed.
[x] For each healthy agent:
- [x] Compare the jobs the agent says it is running to what the database says the agent should be running.
- [x] If there is a job the agent "should" be running, but isn't, then mark that job as failed.
[x] Look at all jobs in the database marked as "pending":
- [x] Look to see when we requested it to start.
- [x] If we requested to start it > 5 seconds ago, mark it as "failed" because it should have started by now.
[x] For each job we have marked as "failed":
- [x] Mark its state as failed in the database.
- [x] Remove it from agent's list of scheduled jobs.
- [ ] Clone the job and assign it to another agent (not for MVP - instead, we can let users see their jobs failed and they can restart it themselves)

lachlan2k commented 1 year ago

Closing as sufficient for now. Thigns like re-scheduling can be future work.

lachlan2k / phatcrack

Failure Condition Handling #7