Open mpass99 opened 1 month ago
Create Reproduction Steps for this behavior
``` json { "ID": "1", "Name": "1", "Type": "batch", "TaskGroups": [ { "Name": "default-group", "Count": 1, "RestartPolicy": { "Attempts": 3, "Interval": 3600000000000, "Delay": 15000000000, "Mode": "fail", "RenderTemplates": false }, "ReschedulePolicy": { "Attempts": 3, "Interval": 21600000000000, "Delay": 60000000000, "DelayFunction": "exponential", "MaxDelay": 240000000000, "Unlimited": false }, "Tasks": [ { "Name": "default-task", "Driver": "docker", "Config": { "command": "sleep", "force_pull": true, "image": "openhpi/co_execenv_python:3.8", "network_mode": "none", "args": [ "infinity" ] } } ] } ] } ```
We conducted 5 repetitions with each having 5 jobs running while restarting the agents.
Interestingly, the recreation does neither count as Restart
nor as Rescheduling
, although the events state that the Allocation is being migrated.
Repetition | Running Jobs after Restart |
---|---|
1 | 5/5 |
2 | 4/5 |
3 | 5/5 |
4 | 0/5 |
5 | 5/5 |
Job
-JobDeregistered
events?@MrSerth I would ask these questions in an upstream Issue?!
Why do some Jobs are completely removed while others are still listed as
complete
/dead
Because Poseidon purge
runner jobs that are being stopped. This seems not only superfluous but likely aggravates the above-described scenario, because when a job might be recreated, we purge it, in the mid of the recreation process.
We have to specify the observed behavior. The behavior differs depending on how recently the job has been created.
In case we (stop and) create the job each time before the restart, the recreation mostly succeeds:
Repetition | Running Jobs after Restart |
---|---|
1 | 5/5 |
2 | 5/5 |
3 | 5/5 |
4 | 4/5 |
If we restart the agents multiple times for the same job, the recreation fails:
Repetition | Running Jobs after first Restart (timestamp) | after second Restart (timestamp) | Time waited |
---|---|---|---|
1 | 5/5 | 0/5 | 10 minutes |
2 | 5/5 (1725533193) | 0/5 (1725533276) | 8 minutes |
3 | 5/5 (1725533749) | 0/5 (1725533794) | 2 minutes |
4 | 5/5 (1725543688) | 0/5 (1725544945) | 30 minutes |
The behavior happens only with the drain on shutdown configuration:
leave_on_interrupt = true
leave_on_terminate = true
client {
drain_on_shutdown {
deadline = "15s"
}
}
Thanks for investigating here. I am currently a bit unsure on how to interpret these results.
drain_on_shutdown
has an influence, but I currently don't understand
why it affects the second restart only why it makes the situation worse (and not better); shouldn't this help to avoid loosing jobs?
Good open questions, I will forward them.
Why is it important how recently the job as been deployed?
That's on my wording. The time of how recently the job has been deployed does not seem to have an influence. Instead, the Agent-restart count seems to be important.
Since I don't have any answers for your three questions, you may proceed to ask them upstream
See hashicorp/nomad#23937
We are currently blocked by the upstream issue and are waiting for a response.
In #612 we noticed that on a simultaneous restart of all Nomad agents, some Jobs are completely removed and disappear, others are dead but still listed, and some are being restarted.
complete
/deadJob
-JobDeregistered
events?