Nomad Agent Simultaneous Restart Behavior

mpass99 commented 1 month ago

In #612 we noticed that on a simultaneous restart of all Nomad agents, some Jobs are completely removed and disappear, others are dead but still listed, and some are being restarted.

Create Reproduction Steps for this behavior
Create an upstream Issue
- Why are most of the Jobs not restarted after the agents are ready again?
- Why do some Jobs are completely removed while others are still listed as complete/dead
- Why do we not receive Job-JobDeregistered events?

mpass99 commented 1 month ago

Create Reproduction Steps for this behavior

Minimal Job Configuration

``` json { "ID": "1", "Name": "1", "Type": "batch", "TaskGroups": [ { "Name": "default-group", "Count": 1, "RestartPolicy": { "Attempts": 3, "Interval": 3600000000000, "Delay": 15000000000, "Mode": "fail", "RenderTemplates": false }, "ReschedulePolicy": { "Attempts": 3, "Interval": 21600000000000, "Delay": 60000000000, "DelayFunction": "exponential", "MaxDelay": 240000000000, "Unlimited": false }, "Tasks": [ { "Name": "default-task", "Driver": "docker", "Config": { "command": "sleep", "force_pull": true, "image": "openhpi/co_execenv_python:3.8", "network_mode": "none", "args": [ "infinity" ] } } ] } ] } ```

We conducted 5 repetitions with each having 5 jobs running while restarting the agents.
Interestingly, the recreation does neither count as Restart nor as Rescheduling, although the events state that the Allocation is being migrated.

Repetition	Running Jobs after Restart
1	5/5
2	4/5
3	5/5
4	0/5
5	5/5

Questions

Why is the Recreation unreliable in this scenario?
Why is the migration (and recreation of the Docker container) counted neither as Restart nor as Rescheduling?
Why do we not receive Job-JobDeregistered events?

@MrSerth I would ask these questions in an upstream Issue?!

Why do some Jobs are completely removed while others are still listed as complete/dead

Because Poseidon purge runner jobs that are being stopped. This seems not only superfluous but likely aggravates the above-described scenario, because when a job might be recreated, we purge it, in the mid of the recreation process.

mpass99 commented 1 month ago

We have to specify the observed behavior. The behavior differs depending on how recently the job has been created.

In case we (stop and) create the job each time before the restart, the recreation mostly succeeds:

Repetition	Running Jobs after Restart
1	5/5
2	5/5
3	5/5
4	4/5

If we restart the agents multiple times for the same job, the recreation fails:

Repetition	Running Jobs after first Restart (timestamp)	after second Restart (timestamp)	Time waited
1	5/5	0/5	10 minutes
2	5/5 (1725533193)	0/5 (1725533276)	8 minutes
3	5/5 (1725533749)	0/5 (1725533794)	2 minutes
4	5/5 (1725543688)	0/5 (1725544945)	30 minutes

The behavior happens only with the drain on shutdown configuration:

leave_on_interrupt = true
leave_on_terminate = true

client {
  drain_on_shutdown {
    deadline = "15s"
  }
}

MrSerth commented 1 month ago

Thanks for investigating here. I am currently a bit unsure on how to interpret these results.

I get that the drain_on_shutdown has an influence, but I currently don't understand
- why it affects the second restart only
- why it makes the situation worse (and not better); shouldn't this help to avoid loosing jobs?
Why is it important how recently the job as been deployed? I could understand if we would hit the restart / rescheduling limit, but for the first agent restart it shouldn't make any difference how "old" the job is.
Since I don'T have any answers for your three questions, you may proceed to ask them upstream (or some other suitable discussion list) :+1:

mpass99 commented 1 month ago

why it affects the second restart only why it makes the situation worse (and not better); shouldn't this help to avoid loosing jobs?

Good open questions, I will forward them.

Why is it important how recently the job as been deployed?

That's on my wording. The time of how recently the job has been deployed does not seem to have an influence. Instead, the Agent-restart count seems to be important.

Since I don't have any answers for your three questions, you may proceed to ask them upstream

See hashicorp/nomad#23937

MrSerth commented 2 weeks ago

We are currently blocked by the upstream issue and are waiting for a response.

openHPI / poseidon

Nomad Agent Simultaneous Restart Behavior #673

Questions