Nomad fails to clean up reserved cores

hxt365 commented 2 months ago

Nomad version

Output from nomad version

Nomad v1.8.1
BuildDate 2024-06-19T06:43:57Z
Revision 5022543e4b7b8dcec9df123f86630ae3fdcffbe6

Operating system and Environment details

NAME="Ubuntu"
VERSION="20.04.5 LTS (Focal Fossa)"

Issue

I runs hundreds of jobs a day on a machine and always set resources.cores for my jobs. Occasionally I get the below issue

2024-08-14 08:59:15.919991105 +0800 +08: Received - Task received by client
2024-08-14 08:59:15.923095086 +0800 +08: Setup Failure - failed to setup alloc: pre-run hook "cpuparts_hook" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: device or resource busy

Once this happens, all jobs on the machine fail to run including Docker and raw_exec jobs. This issue persists until I manually remove /sys/fs/cgroup/nomad.slice/reserve.slice/. I suspect that Nomad fails to clean up reserved cores under some unexpected failure circumstances. I tried removing resources.cores config and it's been working just fine.

Reproduction steps

Not sure how to reproduce this.

Expected Result

No error when scheduling jobs.

Actual Result

2024-08-14 08:59:15.919991105 +0800 +08: Received - Task received by client
2024-08-14 08:59:15.923095086 +0800 +08: Setup Failure - failed to setup alloc: pre-run hook "cpuparts_hook" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: device or resource busy

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

jrasell commented 2 months ago

Hi @hxt365 and thanks for raising this issue.

Tracing the code through, this error is coming after the Nomad client has identified free cores to allocate to the task when it tries to write the selection to the file on the host at "/sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus". The errors as detailed shows the file is already in use, potentially showing there is contention on your host meaning Nomad cannot write to it.

Once this error occurs, does it ever recover? If this happens again, it would be useful to get a list of processes that have the file open via lsof.

hxt365 commented 2 months ago

Thanks for your quick reply @jrasell! The error does not recover itself ever and we don't have any other processes that also reserve cores. Even if there are processes that reserve cores, I suppose Nomad should find available cores instead or evaluation should get blocked?

hxt365 commented 2 months ago

I remember a couple of times I restarted Nomad and got this issue

jrasell commented 2 months ago

Even if there are processes that reserve cores, I suppose Nomad should find available cores instead or evaluation should get blocked?

The doesn't seem to be that Nomad can't find available cores, it's that it is unable to write the configuration update to the file. It seems like something is holding the file after a reboot which might be causing the issue. If you do encounter this again, could you please get a list of processes that have the file open and let me know how you performed the reboot or the process that caused the error?

hgminh95 commented 2 months ago

It seems like something is holding the file after a reboot which might be causing the issue

I encounter this on my side too. I dont think it is because the file is hold by other process. What happen is like below iirc

Nomad start a job and reserve CPU, creating a subfolder inside reserve.slice/ with another cpuset.cpu, which is a subset of reserve.slice/cpuset.cpu
For whatever reason, nomad failed to clean up that reservation (not sure why, but I imagine it is hard or impossible to do this reliably always), and does not keep track of that subfolder anymore.
When nomad try to change the root cpuset, it cannot because the root cpuset must cover all the children cpuset. The error message "device or resource busy" is a bit misleading here.

From cpuset manpage

       EBUSY  Attempted to remove, using [rmdir](https://manpages.ubuntu.com/manpages/focal/en/man2/rmdir.2.html)(2), a cpuset with attached processes.

       EBUSY  Attempted to remove, using [rmdir](https://manpages.ubuntu.com/manpages/focal/en/man2/rmdir.2.html)(2), a cpuset with child cpusets.

       EBUSY  Attempted  to  remove a CPU or memory node from a cpuset that is also in a child of
              that cpuset.

hgminh95 commented 2 months ago

I think maybe the solution here is to remove all the subfolder inside nomad.slice/reserve.slice that nomad does not know of when it restart?

It will not help if nomad get into inconsistent cpuset state while running, but I don't know how likely or which scenario could lead to that.

hxt365 commented 3 weeks ago

Hi. Do we have new updates on this @jrasell ?

hxt365 commented 3 weeks ago

Btw the tag should be theme/platform-linux instead

hashicorp / nomad