openHPI / poseidon

Scalable task execution orchestrator for CodeOcean
MIT License
8 stars 1 forks source link

Nomad Agent Simultaneous Restart Behavior #673

Open mpass99 opened 1 month ago

mpass99 commented 1 month ago

In #612 we noticed that on a simultaneous restart of all Nomad agents, some Jobs are completely removed and disappear, others are dead but still listed, and some are being restarted.

mpass99 commented 1 month ago

Create Reproduction Steps for this behavior

Minimal Job Configuration

``` json { "ID": "1", "Name": "1", "Type": "batch", "TaskGroups": [ { "Name": "default-group", "Count": 1, "RestartPolicy": { "Attempts": 3, "Interval": 3600000000000, "Delay": 15000000000, "Mode": "fail", "RenderTemplates": false }, "ReschedulePolicy": { "Attempts": 3, "Interval": 21600000000000, "Delay": 60000000000, "DelayFunction": "exponential", "MaxDelay": 240000000000, "Unlimited": false }, "Tasks": [ { "Name": "default-task", "Driver": "docker", "Config": { "command": "sleep", "force_pull": true, "image": "openhpi/co_execenv_python:3.8", "network_mode": "none", "args": [ "infinity" ] } } ] } ] } ```

We conducted 5 repetitions with each having 5 jobs running while restarting the agents.
Interestingly, the recreation does neither count as Restart nor as Rescheduling, although the events state that the Allocation is being migrated.

Repetition Running Jobs after Restart
1 5/5
2 4/5
3 5/5
4 0/5
5 5/5

Questions

@MrSerth I would ask these questions in an upstream Issue?!

Why do some Jobs are completely removed while others are still listed as complete/dead

Because Poseidon purge runner jobs that are being stopped. This seems not only superfluous but likely aggravates the above-described scenario, because when a job might be recreated, we purge it, in the mid of the recreation process.

mpass99 commented 1 month ago

We have to specify the observed behavior. The behavior differs depending on how recently the job has been created.

In case we (stop and) create the job each time before the restart, the recreation mostly succeeds:

Repetition Running Jobs after Restart
1 5/5
2 5/5
3 5/5
4 4/5

If we restart the agents multiple times for the same job, the recreation fails:

Repetition Running Jobs after first Restart (timestamp) after second Restart (timestamp) Time waited
1 5/5 0/5 10 minutes
2 5/5 (1725533193) 0/5 (1725533276) 8 minutes
3 5/5 (1725533749) 0/5 (1725533794) 2 minutes
4 5/5 (1725543688) 0/5 (1725544945) 30 minutes

The behavior happens only with the drain on shutdown configuration:

leave_on_interrupt = true
leave_on_terminate = true

client {
  drain_on_shutdown {
    deadline = "15s"
  }
}
MrSerth commented 1 month ago

Thanks for investigating here. I am currently a bit unsure on how to interpret these results.

mpass99 commented 1 month ago

why it affects the second restart only why it makes the situation worse (and not better); shouldn't this help to avoid loosing jobs?

Good open questions, I will forward them.

Why is it important how recently the job as been deployed?

That's on my wording. The time of how recently the job has been deployed does not seem to have an influence. Instead, the Agent-restart count seems to be important.

Since I don't have any answers for your three questions, you may proceed to ask them upstream

See hashicorp/nomad#23937

MrSerth commented 2 weeks ago

We are currently blocked by the upstream issue and are waiting for a response.