Nomad issues after changing its config

mpass99 commented 2 years ago

Expected Behavior

Nomad jobs start nearly instant and memory oversubscription is enabled if stated so in the config.

Current Behavior

curl http://localhost:4646/v1/operator/scheduler/configuration
{"SchedulerConfig":{"SchedulerAlgorithm":"binpack","PreemptionConfig":{"SystemSchedulerEnabled":false,"SysBatchSchedulerEnabled":false,"BatchSchedulerEnabled":false,"ServiceSchedulerEnabled":false},"MemoryOversubscriptionEnabled":true,"RejectJobRegistration":false,"CreateIndex":5,"ModifyIndex":5},"Index":5,"LastContact":0,"KnownLeader":true,"NextToken":""}

Nomad behaves just like it has not read the config
- Not only the MemoryOversubscriptionEnabled flag is wrong, but also the SchedulerAlgorithm.
- When you set the option manually it works fine: curl --cert cli.pem --key cli-key.pem --cacert ca.pem -X POST -d '{"MemoryOversubscriptionEnabled": false}' https://nomad-main-1.internal-codemoon.xopic.de:4646/v1/operator/scheduler/configuration
Runner need about 4.6 minutes to start (not only the first runner of each client). See this Poseidon.log.
- In the Ui the job is related to one client nearly instantly
- But in the job details you see that nearly all time is spend before the client receives the job. See this Nomad.log.
- After the 4.6 minutes Nomad aborts the scheduled allocation and tries it another time. This time it is scheduled nearly instand.
- A scheduling error?

Possible Solution

Workaround for the memory oversubscription.
- Reset and restart all Nomad Server: sudo rm -rf /opt/nomad/data/*
- With only one server having the old data it does not work.
- Regardless of whether the server becomes a leader or not.

(At this point Nomad still takes 4.6 minutes to start an allocation)

Workaround for the scheduling issue.
- Reset and restart also all Nomad Clients: sudo rm -rf /opt/nomad/data/*

How can this be reproduced?

Total Reset of Nomad
- Stop all nodes
- Delete the data of all nodes
- Start all nodes
Restart Poseidon
Change the server config files this ServerConfigBefore.hcl to this ServerConfigAfter.hcl
Restart the Nomad server nodes

Now you can check the scheduler configuration (via curl) and it is not updated. Also Nomad behaves wrong (very long pending state or sometimes a short pending state but no notification via the event stream,...)

Context (Environment)

Already updated Nomad from 1.2.3 to 1.2.6 -> No effect
Our last staging deployment is almost three months ago 1239699 on 23.12.

mpass99 commented 2 years ago

When automating the bug reproduction (with playbook.yml, repair.yml and break.yml) we tried different configurations. We noticed that changes on server.bootstrap are loaded on systemd restart while changes on server.default_scheduler_config.scheduler_algorithm and server.default_scheduler_config.memory_oversubscription_enabled are not loaded on a systemd restart (the nomad data has to be deleted).

The reproduction does not include the long scheduling.

mpass99 commented 2 years ago

@MrSerth How should we deal with it? Accept it and empty the Nomad Data folder once?

MrSerth commented 2 years ago

Thanks for having a detailed look at this issue! I still have a few minor questions about your results:

"Nomad behaves just like it has not read the config"

You configured that through the default_scheduler_config stanza, right? According to this documentation, a change in this stanza is not supported for a cluster that is already bootstrapped. Instead, the authors recommend using the API to update the configuration (which you confirmed to work fine, right?). Hence, I would think that we should enable it by making that API request.

Runner need about 4.6 minutes to start (not only the first runner of each client)

Is this also true if we just enable the memory oversubscription / change the scheduling algorithm through the API? Or only when we mingle with the hcl that should not be edited?

mpass99 commented 2 years ago

Awesome that you found the warning in the documentation! This solves the problem (also the scheduling problem). In the production deployment, we then have to activate the oversubscription via API.

MrSerth commented 2 years ago

Okay, great! Let's schedule that for tomorrow ;)

MrSerth commented 2 years ago

We enabled the memory oversubscription on our servers with the following request:

curl --cert cli.crt --key cli-key.pem --cacert ca.crt -X POST -d '{"SchedulerAlgorithm": "spread", "MemoryOversubscriptionEnabled": true}' https://localhost:4646/v1/operator/scheduler/configuration

openHPI / poseidon