openHPI / poseidon

Scalable task execution orchestrator for CodeOcean
MIT License
9 stars 1 forks source link

Nomad issues after changing its config #107

Closed mpass99 closed 2 years ago

mpass99 commented 2 years ago

Expected Behavior

Nomad jobs start nearly instant and memory oversubscription is enabled if stated so in the config.

Current Behavior

curl http://localhost:4646/v1/operator/scheduler/configuration
{"SchedulerConfig":{"SchedulerAlgorithm":"binpack","PreemptionConfig":{"SystemSchedulerEnabled":false,"SysBatchSchedulerEnabled":false,"BatchSchedulerEnabled":false,"ServiceSchedulerEnabled":false},"MemoryOversubscriptionEnabled":true,"RejectJobRegistration":false,"CreateIndex":5,"ModifyIndex":5},"Index":5,"LastContact":0,"KnownLeader":true,"NextToken":""}

Possible Solution

(At this point Nomad still takes 4.6 minutes to start an allocation)

How can this be reproduced?

Now you can check the scheduler configuration (via curl) and it is not updated. Also Nomad behaves wrong (very long pending state or sometimes a short pending state but no notification via the event stream,...)

Context (Environment)

mpass99 commented 2 years ago

When automating the bug reproduction (with playbook.yml, repair.yml and break.yml) we tried different configurations. We noticed that changes on server.bootstrap are loaded on systemd restart while changes on server.default_scheduler_config.scheduler_algorithm and server.default_scheduler_config.memory_oversubscription_enabled are not loaded on a systemd restart (the nomad data has to be deleted).

The reproduction does not include the long scheduling.

mpass99 commented 2 years ago

@MrSerth How should we deal with it? Accept it and empty the Nomad Data folder once?

MrSerth commented 2 years ago

Thanks for having a detailed look at this issue! I still have a few minor questions about your results:

"Nomad behaves just like it has not read the config"

You configured that through the default_scheduler_config stanza, right? According to this documentation, a change in this stanza is not supported for a cluster that is already bootstrapped. Instead, the authors recommend using the API to update the configuration (which you confirmed to work fine, right?). Hence, I would think that we should enable it by making that API request.

Runner need about 4.6 minutes to start (not only the first runner of each client)

Is this also true if we just enable the memory oversubscription / change the scheduling algorithm through the API? Or only when we mingle with the hcl that should not be edited?

mpass99 commented 2 years ago

Awesome that you found the warning in the documentation! This solves the problem (also the scheduling problem). In the production deployment, we then have to activate the oversubscription via API.

MrSerth commented 2 years ago

Okay, great! Let's schedule that for tomorrow ;)

MrSerth commented 2 years ago

We enabled the memory oversubscription on our servers with the following request:

curl --cert cli.crt --key cli-key.pem --cacert ca.crt -X POST -d '{"SchedulerAlgorithm": "spread", "MemoryOversubscriptionEnabled": true}' https://localhost:4646/v1/operator/scheduler/configuration