migrate cpuset `reserved` partition when upgrading to 1.7+

drofloh commented 7 months ago

Nomad version

$ nomad version Nomad v1.7.3 BuildDate 2024-01-15T16:55:40Z Revision 60ee328f97d19d2d2d9761251b895b06d82eb1a1

Operating system and Environment details

CentOS 7

Issue

When upgrading clients from 1.6.1 - 1.7.3 we are getting the below error:

Jan 30, '24 13:35:00 +0000 | Setup Failure | failed to setup alloc: pre-run hook "cpuparts_hook" failed: open /sys/fs/cgroup/cpuset/nomad/reserve/cpuset.cpus: no such file or directory
-- | -- | --

The file, as per the error doesn't exist, but it does at /sys/fs/cgroup/cpuset/nomad/reserved/cpuset.cpus

Nomad Client logs (if appropriate)

{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-01-30T13:35:00.979723Z","alloc_id":"6652d206-ea14-a6be-f0f2-dbf21db54424","failed":false,"msg":"Task received by client","task":"web-server","type":"Received"}
{"@level":"error","@message":"prerun failed","@module":"client.alloc_runner","@timestamp":"2024-01-30T13:35:00.984050Z","alloc_id":"6652d206-ea14-a6be-f0f2-dbf21db54424","error":"pre-run hook \"cpuparts_hook\" failed: open /sys/fs/cgroup/cpuset/nomad/reserve/cpuset.cpus: no such file or directory"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-01-30T13:35:00.984099Z","alloc_id":"6652d206-ea14-a6be-f0f2-dbf21db54424","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: open /sys/fs/cgroup/cpuset/nomad/reserve/cpuset.cpus: no such file or directory","task":"web-server","type":"Setup Failure"}
{"@level":"error","@message":"postrun failed","@module":"client.alloc_runner","@timestamp":"2024-01-30T13:35:00.991486Z","alloc_id":"6652d206-ea14-a6be-f0f2-dbf21db54424","error":"hook \"cpuparts_hook\" failed: open /sys/fs/cgroup/cpuset/nomad/reserve/cpuset.cpus: no such file or directory"}
{"@level":"info","@message":"marking allocation for GC","@module":"client.gc","@timestamp":"2024-01-30T13:35:00.991493Z","alloc_id":"6652d206-ea14-a6be-f0f2-dbf21db54424"}

Reproduction steps

This happened on a client which we updated from 1.6.1 -> 1.7.3, servers previously updated to 1.7.3 with no issues.

Expected Result

Job runs as expected

Actual Result

job fails to run on clients updated to 1.7.3

Job file (if appropriate)

job "nginx1" {
  namespace = "platforms"
  node_pool = "platforms"
  group "nginx" {
    count = 3
    spread {
      attribute = "${node.unique.name}"
    }
    task "web-server" {
      driver = "docker"
      config {
        image = "nginx:latest"
      }
      resources {
        cpu    = 1000
        memory = 1000
      }
    }
  }
}

drofloh commented 7 months ago

If I create the directory on the host myself and then restart the client all is fine and I see files appear in the "reserve" directory

$ ls -lrt /sys/fs/cgroup/cpuset/nomad/reserve
total 0
-rw-r--r--. 1 root root 0 Jan 30 14:07 tasks
-rw-r--r--. 1 root root 0 Jan 30 14:07 cgroup.procs
-rw-r--r--. 1 root root 0 Jan 30 14:07 notify_on_release
-rw-r--r--. 1 root root 0 Jan 30 14:07 cpuset.sched_relax_domain_level
-rw-r--r--. 1 root root 0 Jan 30 14:07 cpuset.sched_load_balance
-rw-r--r--. 1 root root 0 Jan 30 14:07 cpuset.mems
-rw-r--r--. 1 root root 0 Jan 30 14:07 cpuset.memory_spread_slab
-rw-r--r--. 1 root root 0 Jan 30 14:07 cpuset.memory_spread_page
-r--r--r--. 1 root root 0 Jan 30 14:07 cpuset.memory_pressure
-rw-r--r--. 1 root root 0 Jan 30 14:07 cpuset.memory_migrate
-rw-r--r--. 1 root root 0 Jan 30 14:07 cpuset.mem_hardwall
-rw-r--r--. 1 root root 0 Jan 30 14:07 cpuset.mem_exclusive
-r--r--r--. 1 root root 0 Jan 30 14:07 cpuset.effective_mems
-r--r--r--. 1 root root 0 Jan 30 14:07 cpuset.effective_cpus
-rw-r--r--. 1 root root 0 Jan 30 14:07 cpuset.cpu_exclusive
--w--w--w-. 1 root root 0 Jan 30 14:07 cgroup.event_control
-rw-r--r--. 1 root root 0 Jan 30 14:07 cgroup.clone_children
-rw-r--r--. 1 root root 0 Jan 30 14:10 cpuset.cpus

eduardolmedeiros commented 7 months ago

It might be useful, but I'm facing the same issue on 1.7.2 and Rocky 8. I've managed to fix the issue by rebooting the host (somehow the folder is created automatically after reboot) or creating manually the folder /sys/fs/cgroup/cpuset/nomad/reserve also works.

cesan3 commented 6 months ago

It happens to us as well. Migration from 1.6.1 -> 1.7.3.

In current 1.6.1 server, the nomad cpuset subsystem cgroup reservation is created in : /sys/fs/cgroup/cpuset/nomad/reserved

 sudo ls -l /sys/fs/cgroup/cpuset/nomad/reserved/
total 0
-rw-r--r--. 1 root root 0 Feb  6 15:03 cgroup.clone_children
-rw-r--r--. 1 root root 0 Feb  6 15:03 cgroup.procs
-rw-r--r--. 1 root root 0 Feb  6 15:03 cpuset.cpu_exclusive
-rw-r--r--. 1 root root 0 Feb  3 09:13 cpuset.cpus
-r--r--r--. 1 root root 0 Feb  6 15:03 cpuset.effective_cpus
-r--r--r--. 1 root root 0 Feb  6 15:03 cpuset.effective_mems
-rw-r--r--. 1 root root 0 Feb  6 15:03 cpuset.mem_exclusive
-rw-r--r--. 1 root root 0 Feb  6 15:03 cpuset.mem_hardwall
-rw-r--r--. 1 root root 0 Feb  6 15:03 cpuset.memory_migrate
-r--r--r--. 1 root root 0 Feb  6 15:03 cpuset.memory_pressure
-rw-r--r--. 1 root root 0 Feb  6 15:03 cpuset.memory_spread_page
-rw-r--r--. 1 root root 0 Feb  6 15:03 cpuset.memory_spread_slab
-rw-r--r--. 1 root root 0 Feb  3 08:51 cpuset.mems
-rw-r--r--. 1 root root 0 Feb  6 15:03 cpuset.sched_load_balance
-rw-r--r--. 1 root root 0 Feb  6 15:03 cpuset.sched_relax_domain_level
-rw-r--r--. 1 root root 0 Feb  6 15:03 notify_on_release
-rw-r--r--. 1 root root 0 Feb  6 15:03 tasks

When upgrading to 1.7.3, we get the same reported error for the allocations:

Recent Events:
Time                  Type           Description
2024-02-06T13:29:42Z  Setup Failure  failed to setup alloc: pre-run hook "cpuparts_hook" failed: open /sys/fs/cgroup/cpuset/nomad/reserve/cpuset.cpus: no such file or directory
2024-02-06T13:29:42Z  Received       Task received by client

If we restart the nodes, the cpuset subsystem reservation directory is created in the expected /sys/fs/cgroup/cpuset/nomad/reserve/ path and then, Job deployment succeed

lgfa29 commented 6 months ago

Hi everyone 👋

I'm still trying to reproduce this issue, but in the mean time would you be able to check on your Nomad client logs for a message such as failed to create reserve cpuset partition? Nomad should be creating this path automatically on start, so not having that path means something when wrong there.

Thanks!

cesan3 commented 6 months ago

Hi @lgfa29

Hi everyone 👋

I'm still trying to reproduce this issue, but in the mean time would you be able to check on your Nomad client logs for a message such as failed to create reserve cpuset partition? Nomad should be creating this path automatically on start, so not having that path means something when wrong there.

Thanks!

I checked the logs after the upgrade and I couldn't find failed to create reserve cpuset partition. but I found these errors right after nomad is started after the upgrade:

2024-02-07T18:47:11.756Z [INFO]  client.fingerprint_mgr.vault: Vault is available: cluster=default
2024-02-07T18:47:11.777Z [INFO]  client.proclib.cg1: initializing nomad cgroups: cores="0,2-7"
2024-02-07T18:47:11.777Z [ERROR] client.proclib.cg1: failed to write cores to nomad cpuset cgroup: error="write /sys/fs/cgroup/cpuset/nomad/cpuset.cpus: device or resource busy"
2024-02-07T18:47:11.777Z [INFO]  client.plugin: starting plugin manager: plugin-type=csi
2024-02-07T18:47:11.777Z [INFO]  client.plugin: starting plugin manager: plugin-type=driver
2024-02-07T18:47:11.778Z [INFO]  client.plugin: starting plugin manager: plugin-type=device

lgfa29 commented 6 months ago

Thanks for the extra info @cesan3!

Yeah, I just noticed that there are several paths where an error can happen, each with a different error message.

Unfortunately there's not much that we can do in this case as there are multiple reasons those path creation may fail. But the agent shouldn't start in a state where it can't run tasks so I opened #19915 to handle this.

cesan3 commented 6 months ago

So, quick question @lgfa29 Do we have another ticket to fix the original problem regarding the migration path from nomad 1.6.1 -> 1.7.x ? Now with the fix, my migration stops earlier when nomad agent starts:

2024-02-13T20:35:56.089Z [INFO]  client.fingerprint_mgr.vault: Vault is available: cluster=default
2024-02-13T20:35:56.100Z [INFO]  client.proclib.cg1: initializing nomad cgroups: cores="0,2-7"
2024-02-13T20:35:56.101Z [ERROR] agent: error starting agent: error="client setup failed: failed to initialize process manager: failed to write cores to nomad cpuset cgroup: write /sys/fs/cgroup/cpuset/nomad/cpuset.cpus: device or resource busy"

Is there any plans to fix the migration?

lgfa29 commented 6 months ago

Could you check which process is keeping that path busy using something like fuser or lsof?

I will reopen this issue until we better understand the problem.

cesan3 commented 6 months ago

Hey @lgfa29 lsof doesn't show anything when querying /sys/fs/cgroup/cpuset/nomad/cpuset.cpus

I presume that 2 running allocations are keeping it busy???

tmpfs               1024        4      1020   1% /nomad/data/alloc/.../traefik/mnt1
tmpfs               1024        4      1020   1% /nomad/data/alloc/.../traefik/mnt2

Maybe some cgroup active children?

I checked

 mount -t cgroup | cut -f 3 -d ' '
/sys/fs/cgroup/systemd
/sys/fs/cgroup/net_cls,net_prio
/sys/fs/cgroup/cpuset
/sys/fs/cgroup/cpu,cpuacct
/sys/fs/cgroup/perf_event
/sys/fs/cgroup/pids
/sys/fs/cgroup/rdma
/sys/fs/cgroup/blkio
/sys/fs/cgroup/devices
/sys/fs/cgroup/memory
/sys/fs/cgroup/freezer
/sys/fs/cgroup/hugetlb

and

find /sys/fs/cgroup -maxdepth 1 -type l -exec ls {} \;
nomad  system.slice
nomad  system.slice

But the only way of fixing it this time was rebooting the server.

drofloh commented 6 months ago

We have as part of the upgrade from 1.6.1 -> 1.7.3 now added the /sys/fs/cgroup/cpuset/nomad/reserve directory ahead of the client restart, which resolved the issue on the majority of nodes, however some then also exhibited a similar issue but related to the /sys/fs/cgroup/cpuset/nomad/share dir not being present, which was /sys/fs/cgroup/cpuset/nomad/shared in 1.6.1 it seems. Also creating this dir ahead of the client restart helps as does a full system reboot.

failed to setup alloc: pre-run hook "cpuparts_hook" failed: open /sys/fs/cgroup/cpuset/nomad/share/cpuset.cpus: no such file or directory

Is there a reason these directories seem to of changed from 1.6.1 -> 1.7.3 from reserved and shared to reserve and share?

liukch commented 2 months ago

Same issue while upgrade from 1.6.6 -> 1.7.7

This issue does not always reproduce stably. When I restarted a new instance, then started the Nomad client process, then exited normally by sending SIGINT. When restarting, there is a probability of error：agent: error starting agent: error="client setup failed: failed to initialize process manager: failed to write root p artition cpuset: write /sys/fs/cgroup/nomad.slice/cpuset.cpus: device or resource busy"

cesan3 commented 2 months ago

Unfortunately, in our case, to migrate from 1.6.x to 1.7.x, we had to automate the creation of the expected directories and the nomad cgroup controller removal using cgdelete -g cpuset:/nomad to avoid the node's reboot.

But once you're on 1.7.x, you can upgrade normally.

tgross commented 2 months ago

Doing a little bit of issue cleanup. There's a workaround for the original issue here, but the upgrade path is still not very nice. I'm going to re-title this and mark it for roadmapping.

The underlying issue is that in 1.7.x and beyond the name of the partition is reserve (ref partition.go#L33-L42) whereas originally it was reserved (with a "d") (ref cpuset_manager_v1.go#L31) and there's no migration in the client.

hashicorp / nomad