Open hxt365 opened 2 months ago
Hi @hxt365 and thanks for raising this issue.
Tracing the code through, this error is coming after the Nomad client has identified free cores to allocate to the task when it tries to write the selection to the file on the host at "/sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus". The errors as detailed shows the file is already in use, potentially showing there is contention on your host meaning Nomad cannot write to it.
Once this error occurs, does it ever recover? If this happens again, it would be useful to get a list of processes that have the file open via lsof
.
Thanks for your quick reply @jrasell! The error does not recover itself ever and we don't have any other processes that also reserve cores. Even if there are processes that reserve cores, I suppose Nomad should find available cores instead or evaluation should get blocked?
I remember a couple of times I restarted Nomad and got this issue
Even if there are processes that reserve cores, I suppose Nomad should find available cores instead or evaluation should get blocked?
The doesn't seem to be that Nomad can't find available cores, it's that it is unable to write the configuration update to the file. It seems like something is holding the file after a reboot which might be causing the issue. If you do encounter this again, could you please get a list of processes that have the file open and let me know how you performed the reboot or the process that caused the error?
It seems like something is holding the file after a reboot which might be causing the issue
I encounter this on my side too. I dont think it is because the file is hold by other process. What happen is like below iirc
reserve.slice/
with another cpuset.cpu
, which is a subset of reserve.slice/cpuset.cpu
From cpuset manpage
EBUSY Attempted to remove, using [rmdir](https://manpages.ubuntu.com/manpages/focal/en/man2/rmdir.2.html)(2), a cpuset with attached processes.
EBUSY Attempted to remove, using [rmdir](https://manpages.ubuntu.com/manpages/focal/en/man2/rmdir.2.html)(2), a cpuset with child cpusets.
EBUSY Attempted to remove a CPU or memory node from a cpuset that is also in a child of
that cpuset.
I think maybe the solution here is to remove all the subfolder inside nomad.slice/reserve.slice
that nomad does not know of when it restart?
It will not help if nomad get into inconsistent cpuset state while running, but I don't know how likely or which scenario could lead to that.
Hi. Do we have new updates on this @jrasell ?
Btw the tag should be theme/platform-linux
instead
Nomad version
Output from
nomad version
Operating system and Environment details
Issue
I runs hundreds of jobs a day on a machine and always set
resources.cores
for my jobs. Occasionally I get the below issueOnce this happens, all jobs on the machine fail to run including Docker and raw_exec jobs. This issue persists until I manually remove
/sys/fs/cgroup/nomad.slice/reserve.slice/
. I suspect that Nomad fails to clean up reserved cores under some unexpected failure circumstances. I tried removingresources.cores
config and it's been working just fine.Reproduction steps
Not sure how to reproduce this.
Expected Result
No error when scheduling jobs.
Actual Result
Job file (if appropriate)
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)