nestybox / sysbox

An open-source, next-generation "runc" that empowers rootless containers to run workloads such as Systemd, Docker, Kubernetes, just like VMs.
Apache License 2.0
2.73k stars 151 forks source link

sysbox-fs high cpu usage, infinite unmount calls on shared device #808

Open sibidharan opened 3 months ago

sibidharan commented 3 months ago

When I launch KASM container with sysbox with a GPU by sharing --device=/dev/dri/renderD128, sysbox-fs logs go crazy and it goes high CPU usage. I enabled logs and I see this

time="2024-06-01 03:05:54" level=debug msg="Received umount syscall from pid 1098145"
time="2024-06-01 03:05:54" level=debug msg="target: /run/systemd/mount-rootfs/sys/devices/virtual, flags: 0x8, root: /, cwd: /"
time="2024-06-01 03:05:54" level=debug msg="Ignoring unmount of sysbox-fs managed submount at /run/systemd/mount-rootfs/sys/devices/virtual"
time="2024-06-01 03:05:54" level=debug msg="Received umount syscall from pid 1098145"
time="2024-06-01 03:05:54" level=debug msg="target: /run/systemd/mount-rootfs/sys, flags: 0x8, root: /, cwd: /"
time="2024-06-01 03:05:54" level=debug msg="Received umount syscall from pid 1098145"
time="2024-06-01 03:05:54" level=debug msg="target: /run/systemd/mount-rootfs/sys/devices/virtual, flags: 0x8, root: /, cwd: /"
time="2024-06-01 03:05:54" level=debug msg="Ignoring unmount of sysbox-fs managed submount at /run/systemd/mount-rootfs/sys/devices/virtual"

If I restart sysbox-fs service, this issue goes away temporarily on deployed containers (unable to docker exec the running containers afterwards), but if I deploy a new container, this issue again starts while sharing devices or somewhere else (?).

Any reason what causes the infinite loop of /run/systemd/mount-rootfs/sys/devices/virtual unmount call that goes away when sysbox-fs is restarted?

Log File: sysbox-fs.log

(After some researching...)

I can see a lot of "Received umount syscall from pid 1092497" for different targets, and they seem to go perfectly. I just searched for the first occurrence of umount in the log file, tracing every umount call.

time="2024-06-01 02:57:40" level=debug msg="Received umount syscall from pid 1092497"
time="2024-06-01 02:57:40" level=debug msg="target: /sys/fs/cgroup/unified, flags: 0x8, root: /, cwd: /var/labsdata"
time="2024-06-01 02:57:40" level=debug msg="Received mount syscall from pid 1092497"
time="2024-06-01 02:57:40" level=debug msg="source: cgroup2, target: /sys/fs/cgroup/unified, fstype: cgroup2, flags: 0xe, data: , root: /, cwd: /var/labsdata"
time="2024-06-01 02:57:40" level=debug msg="Received umount syscall from pid 1092497"
time="2024-06-01 02:57:40" level=debug msg="target: /sys/fs/cgroup/unified, flags: 0x8, root: /, cwd: /var/labsdata"

From line 7379 of log file we can see the first occurance of umount call to /run/systemd/mount-rootfs/sys/devices/virtual that gets ignored, and from then on its just an infinite loop, for every container I deploy with a device, this just adds up and the log file is full of this messages, I have to turn off the debug log else its consuming lotta storage. This just don't stop, only if I pass the --device=/dev/dri/renderD128, and with the little knowledge I have, I am able to understand this infinite umount calls should be related to this device I passed, somehow causing an infinite loop.

time="2024-06-01 02:58:30" level=debug msg="Received umount syscall from pid 1098145"
time="2024-06-01 02:58:30" level=debug msg="target: /run/systemd/mount-rootfs/sys/devices/virtual, flags: 0x8, root: /, cwd: /"
time="2024-06-01 02:58:30" level=debug msg="Requested ReadDirAll() on directory /sys/kernel/mm/hugepages (req ID=0x1454)"
time="2024-06-01 02:58:30" level=debug msg="Executing ReadDirAll() for req-id: 0x1454, handler: SysKernel, resource: hugepages"
time="2024-06-01 02:58:30" level=debug msg="Ignoring unmount of sysbox-fs managed submount at /run/systemd/mount-rootfs/sys/devices/virtual"

I went through the code located at https://github.com/nestybox/sysbox-fs/blob/master/nsenter/utils.go - this file has a potential possibility to go on a cleanup loop that could repeatedly send unmount calls, that later gets ignored by seccomp, as shown in the log, from here: https://github.com/nestybox/sysbox-fs/blob/4c2bc153f33af1bd30a227a14ecfc8174ff280d5/seccomp/umount.go#L128

Can we skip these devices from unmounting that are for sure going to get ignored by seccomp thus saving lot of CPU? Is my understanding of whats going on is correct? If so, how to solve this issue?

ctalledo commented 3 months ago

Hi @sibidharan, thanks for trying Sysbox and reporting the issue.

When I launch KASM container with sysbox with a GPU by sharing --device=/dev/dri/renderD128

I don't know that sharing the device will work, it will probably show up with nobody:nogroup inside the rootless Sysbox container. There's a long-standing GitHub issue about adding support for GPUs inside Sysbox containers but the work is not yet completed.

Regarding the continous unmounts: maybe there's a bug in Sysbox, but let me provide a bit of context so you can better understand what's going on (and hopefully help us root cause the problem).

Sysbox intercepts the mount and umount syscalls inside the container. When intercepting the umount syscall, one of the things Sysbox does is to check if the unmount is targeting of the sysbox-fs mounts inside the container (e.g., /sys/kernel, /sys/devices/virtual, and a few others). If so, it ignores the unmount for security reasons, as otherwise a process inside the container can tear-down the sysbox-fs mounts and likely bypass the kernel-resource namespacing that these mounts provide.

My guess is that what's happening is that systemd inside the Sysbox container is unmounting /run/systemd/mount-rootfs/sys/devices/virtual, which must be a sysbox-fs mount too (i.e., it must be a bind-mount of /sys/devices/virtual into a different mountpoint). You can confirm by doing findmnt inside the Sysbox container. Nevertheless, Sysbox should allow that mount to be unmounted because it's not the original /sys/devices/virtual mount, but rather a bind-mount of it. My guess is that due to some bug in Sysbox, it's getting confused and ignoring the unmount, and that in turn is causing systemd to re-issue the unmount call (ad infinitum).

Let's start by having you post the output of findmnt inside the Sysbox container to figure out what's going on ...

Thanks!

sibidharan commented 3 months ago

I don't know that sharing the device will work, it will probably show up with nobody:nogroup inside the rootless Sysbox container. There's a long-standing GitHub issue about adding support for GPUs inside Sysbox containers but the work is not yet completed.

Yes, for this I created a udev rule that chmod the /dev/dri/renderD128 as 666 readable one, and I am not sending the actual card0 but just the renderD128 node, which is actually working inside sysbox and I am able to use GPU acceleration very well by sharing the renderD128 node. Indeed it is showing as nobody:nogroup but with the help of udev rules I am able to use the character device it without any issue.

As far the findmnt here is the output from inside the sysbox container that is not a rootless setup, and it is using /sbin/init as PID 1.

TARGET               SOURCE                                         FSTYPE   OPTIONS
/                    .                                              shiftfs  rw,relatime
├─/sys               sysfs                                          sysfs    rw,nosuid,nodev,noexec,relatime
│ ├─/sys/firmware    tmpfs                                          tmpfs    ro,relatime,uid=165536,gid=165536,inode64
│ ├─/sys/fs/cgroup   tmpfs                                          tmpfs    ro,nosuid,nodev,noexec,size=4096k,nr_inodes=1024,mode=755,u
│ │ ├─/sys/fs/cgroup/systemd
│ │ │                systemd                                        cgroup   rw,nosuid,nodev,noexec,relatime,xattr,name=systemd
│ │ ├─/sys/fs/cgroup/net_cls,net_prio
│ │ │                cgroup                                         cgroup   rw,nosuid,nodev,noexec,relatime,net_cls,net_prio
│ │ ├─/sys/fs/cgroup/cpuset
│ │ │                cgroup                                         cgroup   rw,nosuid,nodev,noexec,relatime,cpuset
│ │ ├─/sys/fs/cgroup/freezer
│ │ │                cgroup                                         cgroup   rw,nosuid,nodev,noexec,relatime,freezer
│ │ ├─/sys/fs/cgroup/devices
│ │ │                cgroup                                         cgroup   rw,nosuid,nodev,noexec,relatime,devices
│ │ ├─/sys/fs/cgroup/hugetlb
│ │ │                cgroup                                         cgroup   rw,nosuid,nodev,noexec,relatime,hugetlb
│ │ ├─/sys/fs/cgroup/memory
│ │ │                cgroup                                         cgroup   rw,nosuid,nodev,noexec,relatime,memory
│ │ ├─/sys/fs/cgroup/cpu,cpuacct
│ │ │                cgroup                                         cgroup   rw,nosuid,nodev,noexec,relatime,cpu,cpuacct
│ │ ├─/sys/fs/cgroup/rdma
│ │ │                cgroup                                         cgroup   rw,nosuid,nodev,noexec,relatime,rdma
│ │ ├─/sys/fs/cgroup/misc
│ │ │                cgroup                                         cgroup   rw,nosuid,nodev,noexec,relatime,misc
│ │ ├─/sys/fs/cgroup/perf_event
│ │ │                cgroup                                         cgroup   rw,nosuid,nodev,noexec,relatime,perf_event
│ │ ├─/sys/fs/cgroup/blkio
│ │ │                cgroup                                         cgroup   rw,nosuid,nodev,noexec,relatime,blkio
│ │ └─/sys/fs/cgroup/pids
│ │                  cgroup                                         cgroup   rw,nosuid,nodev,noexec,relatime,pids
│ ├─/sys/devices/virtual
│ │ │                sysboxfs[/sys/devices/virtual]                 fuse     rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permi
│ │ └─/sys/devices/virtual/powercap
│ │                  tmpfs                                          tmpfs    ro,relatime,uid=165536,gid=165536,inode64
│ ├─/sys/kernel      sysboxfs[/sys/kernel]                          fuse     rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permi
│ └─/sys/module/nf_conntrack/parameters
│                    sysboxfs[/sys/module/nf_conntrack/parameters]  fuse     rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permi
├─/proc              proc                                           proc     rw,nosuid,nodev,noexec,relatime
│ ├─/proc/bus        proc[/bus]                                     proc     ro,nosuid,nodev,noexec,relatime
│ ├─/proc/fs         proc[/fs]                                      proc     ro,nosuid,nodev,noexec,relatime
│ ├─/proc/irq        proc[/irq]                                     proc     ro,nosuid,nodev,noexec,relatime
│ ├─/proc/sysrq-trigger
│ │                  proc[/sysrq-trigger]                           proc     ro,nosuid,nodev,noexec,relatime
│ ├─/proc/acpi       tmpfs                                          tmpfs    ro,relatime,uid=165536,gid=165536,inode64
│ ├─/proc/keys       udev[/null]                                    devtmpfs rw,nosuid,relatime,size=40632724k,nr_inodes=10158181,mode=7
│ ├─/proc/timer_list udev[/null]                                    devtmpfs rw,nosuid,relatime,size=40632724k,nr_inodes=10158181,mode=7
│ ├─/proc/scsi       tmpfs                                          tmpfs    ro,relatime,uid=165536,gid=165536,inode64
│ ├─/proc/swaps      sysboxfs[/proc/swaps]                          fuse     rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permi
│ ├─/proc/sys        sysboxfs[/proc/sys]                            fuse     rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permi
│ └─/proc/uptime     sysboxfs[/proc/uptime]                         fuse     rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permi
├─/dev               tmpfs                                          tmpfs    rw,nosuid,size=65536k,mode=755,uid=165536,gid=165536,inode6
│ ├─/dev/mqueue      mqueue                                         mqueue   rw,nosuid,nodev,noexec,relatime
│ ├─/dev/pts         devpts                                         devpts   rw,nosuid,noexec,relatime,gid=165541,mode=620,ptmxmode=666
│ ├─/dev/shm         shm                                            tmpfs    rw,nosuid,nodev,noexec,relatime,size=65536k,uid=165536,gid=
│ ├─/dev/null        udev[/null]                                    devtmpfs rw,nosuid,relatime,size=40632724k,nr_inodes=10158181,mode=7
│ ├─/dev/random      udev[/random]                                  devtmpfs rw,nosuid,relatime,size=40632724k,nr_inodes=10158181,mode=7
│ ├─/dev/kmsg        udev[/null]                                    devtmpfs rw,nosuid,relatime,size=40632724k,nr_inodes=10158181,mode=7
│ ├─/dev/full        udev[/full]                                    devtmpfs rw,nosuid,relatime,size=40632724k,nr_inodes=10158181,mode=7
│ ├─/dev/tty         udev[/tty]                                     devtmpfs rw,nosuid,relatime,size=40632724k,nr_inodes=10158181,mode=7
│ ├─/dev/zero        udev[/zero]                                    devtmpfs rw,nosuid,relatime,size=40632724k,nr_inodes=10158181,mode=7
│ ├─/dev/urandom     udev[/urandom]                                 devtmpfs rw,nosuid,relatime,size=40632724k,nr_inodes=10158181,mode=7
│ └─/dev/dri/renderD128
│                    udev[/dri/renderD128]                          devtmpfs rw,nosuid,relatime,size=40632724k,nr_inodes=10158181,mode=7
├─/run               tmpfs                                          tmpfs    rw,nosuid,nodev,relatime,size=65536k,mode=755,uid=165536,gi
│ ├─/run/user/1000   tmpfs                                          tmpfs    rw,nosuid,nodev,relatime,size=419428k,nr_inodes=104857,mode
│ └─/run/lock        tmpfs                                          tmpfs    rw,nosuid,nodev,noexec,relatime,size=4096k,uid=165536,gid=1
├─/var/lib/buildkit  /dev/mapper/ubuntu--vg-ubuntu--lv[/var/lib/sysbox/buildkit/0c944f10badddd149a0c05eee3fdf96a2ac36258e51362040cca226301595fae]
│                                                                   ext4     rw,relatime
├─/var/labsstorage   /dev/vdb[/storage/bc8f9027a554affb095f58827a850269]
│                                                                   xfs      rw,relatime,idmapped,attr2,inode64,logbufs=8,logbsize=32k,n
├─/etc/resolv.conf   /dev/mapper/ubuntu--vg-ubuntu--lv[/var/lib/docker/containers/0c944f10badddd149a0c05eee3fdf96a2ac36258e51362040cca226301595fae/resolv.conf]
│                                                                   ext4     rw,relatime,idmapped
├─/etc/hostname      /dev/mapper/ubuntu--vg-ubuntu--lv[/var/lib/docker/containers/0c944f10badddd149a0c05eee3fdf96a2ac36258e51362040cca226301595fae/hostname]
│                                                                   ext4     rw,relatime,idmapped
├─/etc/hosts         /dev/mapper/ubuntu--vg-ubuntu--lv[/var/lib/docker/containers/0c944f10badddd149a0c05eee3fdf96a2ac36258e51362040cca226301595fae/hosts]
│                                                                   ext4     rw,relatime,idmapped
├─/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs
│                    /dev/mapper/ubuntu--vg-ubuntu--lv[/var/lib/sysbox/containerd/0c944f10badddd149a0c05eee3fdf96a2ac36258e51362040cca226301595fae]
│                                                                   ext4     rw,relatime
├─/var/lib/docker    /dev/mapper/ubuntu--vg-ubuntu--lv[/var/lib/sysbox/docker/0c944f10badddd149a0c05eee3fdf96a2ac36258e51362040cca226301595fae]
│                                                                   ext4     rw,relatime
├─/var/lib/rancher/k3s
│                    /dev/mapper/ubuntu--vg-ubuntu--lv[/var/lib/sysbox/rancher-k3s/0c944f10badddd149a0c05eee3fdf96a2ac36258e51362040cca226301595fae]
│                                                                   ext4     rw,relatime
├─/var/lib/rancher/rke2
│                    /dev/mapper/ubuntu--vg-ubuntu--lv[/var/lib/sysbox/rancher-rke2/0c944f10badddd149a0c05eee3fdf96a2ac36258e51362040cca226301595fae]
│                                                                   ext4     rw,relatime
├─/var/lib/kubelet   /dev/mapper/ubuntu--vg-ubuntu--lv[/var/lib/sysbox/kubelet/0c944f10badddd149a0c05eee3fdf96a2ac36258e51362040cca226301595fae]
│                                                                   ext4     rw,relatime
├─/var/lib/k0s       /dev/mapper/ubuntu--vg-ubuntu--lv[/var/lib/sysbox/k0s/0c944f10badddd149a0c05eee3fdf96a2ac36258e51362040cca226301595fae]
│                                                                   ext4     rw,relatime
├─/usr/src/linux-headers-5.15.0-107
│                    /dev/mapper/ubuntu--vg-ubuntu--lv[/usr/src/linux-headers-5.15.0-107]
│                                                                   ext4     ro,relatime,idmapped
├─/usr/src/linux-headers-5.15.0-107-generic
│                    /dev/mapper/ubuntu--vg-ubuntu--lv[/usr/src/linux-headers-5.15.0-107-generic]
│                                                                   ext4     ro,relatime,idmapped
└─/usr/lib/modules/5.15.0-107-generic
                     /dev/mapper/ubuntu--vg-ubuntu--lv[/usr/lib/modules/5.15.0-107-generic]
                                                                    ext4     ro,relatime,idmapped
ctalledo commented 3 months ago

Hi @sibidharan, thanks for the findmnt output, but I don't see the mount on /run/systemd/mount-rootfs/sys/devices/virtual in it. I think you need to take the findmnt output when the problem actually occurs (?).

Looking again at this, I think systemd is actually recursively bind-mounting the container's rootfs (/) to /run/systemd/mount-rootfs/, and then unmounting /run/systemd/mount-rootfs/sys/devices/virtual from it, but Sysbox is ignoring the last unmount. Not sure though (this is why the findmnt at the time the problem occurs would help).

Assuming I am right, you can try the following to manually reproduce (inside the Sysbox container):

$ mkdir -p /run/test/mount-rootfs
$ mount --rbind / /run/test/mount-rootfs    # recursively bind mount the container's root (/) to /run/test/mount-rootfs
$ umount /run/test/mount-rootfs/sys/devices/virtual

After the last umount, check if the mount at /run/test/mount-rootfs/sys/devices/virtual was removed or if it remains in place.

sibidharan commented 3 months ago

My guess is that what's happening is that systemd inside the Sysbox container is unmounting /run/systemd/mount-rootfs/sys/devices/virtual, which must be a sysbox-fs mount too (i.e., it must be a bind-mount of /sys/devices/virtual into a different mountpoint).

Thank you for this valuable input, I was able to boildown the process thats issuing infinite call under /sbin/init called (crub_all) - looks non standard and I am unable to find any info about this (crub_all) in the internet, but killing this process stops the infinite unmount calls. To temporarily address this issue, I am issuing a kill command for this process shortly after the container is up, but I am not sure if it is a right approach, since this feels a bit hacky.

I think you need to take the findmnt output when the problem actually occurs (?).

I indeed took this findmnt from inside the sysbox container. Do you want to see the host's findmnt aswell?

Assuming I am right, you can try the following to manually reproduce (inside the Sysbox container):

$ mkdir -p /run/test/mount-rootfs
$ mount --rbind / /run/test/mount-rootfs    # recursively bind mount the container's root (/) to /run/test/mount-rootfs
$ umount /run/test/mount-rootfs/sys/devices/virtual

After the last umount, check if the mount at /run/test/mount-rootfs/sys/devices/virtual was removed or if it remains in place.

Here is the output:

root@kali:/var/labsdata# mkdir -p /run/test/mount-rootfs
root@kali:/var/labsdata# mount --rbind / /run/test/mount-rootfs
mount: /run/test/mount-rootfs: permission denied.
       dmesg(1) may have more information after failed mount system call.
root@kali:/var/labsdata# umount /run/test/mount-rootfs/sys/devices/virtual
umount: /run/test/mount-rootfs/sys/devices/virtual: no mount point specified.
root@kali:/var/labsdata#
ctalledo commented 3 months ago

Thank you for this valuable input, I was able to boildown the process thats issuing infinite call under /sbin/init called (crub_all) - looks non standard and I am unable to find any info about this (crub_all) in the internet, but killing this process stops the infinite unmount calls.

I see ... interesting; it must be related to systemd though, since it's trying to unmount under a systemd path (/run/systemd/mount-rootfs/sys/devices/virtual). Maybe looking at the systemd source code will yield more info about it.

I indeed took this findmnt from inside the sysbox container. Do you want to see the host's findmnt aswell?

Sorry I meant to take the findmnt when the crub_all process is issuing the infinite umount syscalls, so I can see the mountpoint it's trying to unmount.

It may be that you need to issue findmnt in a loop too until the findmnt output reports a mount under /run/systemd/mount-rootfs/sys/devices/virtual (?)

sibidharan commented 3 months ago

Thank you for this valuable input, I was able to boildown the process thats issuing infinite call under /sbin/init called (crub_all) - looks non standard and I am unable to find any info about this (crub_all) in the internet, but killing this process stops the infinite unmount calls.

I see ... interesting; it must be related to systemd though, since it's trying to unmount under a systemd path (/run/systemd/mount-rootfs/sys/devices/virtual). Maybe looking at the systemd source code will yield more info about it.

I indeed took this findmnt from inside the sysbox container. Do you want to see the host's findmnt aswell?

Sorry I meant to take the findmnt when the crub_all process is issuing the infinite umount syscalls, so I can see the mountpoint it's trying to unmount.

It may be that you need to issue findmnt in a loop too until the findmnt output reports a mount under /run/systemd/mount-rootfs/sys/devices/virtual (?)

Yes, I run findmnt with watch -n 1 "findmnt | grep virtual" while (crub_all) is sending infinite umount calls but nothing much is showing up in the findmnt. Should I increase the watch frequency? I will try this and update you.

ctalledo commented 3 months ago

Yes, I run findmnt with watch -n 1 "findmnt | grep virtual" while (crub_all) is sending infinite umount calls but nothing much is showing up in the findmnt. Should I increase the watch frequency? I will try this and update you.

Got it ... yes increase it to 0.5 secs (the minimum supported I believe): watch -n 0.5 ...

Hopefully that works; if it doesn't then we need to think of another approach to debug it ...

ctalledo commented 3 months ago

If that doesn't work, maybe the unmount is happening in a dedicated mount namespace. In that case, from inside the sysbox container, find the PID of the crub_all process that performs that unmount, and then try:

$ cat /proc/<pid>/mountinfo | grep "/run/systemd/mount-rootfs/sys/devices/virtual"

Or maybe:

$ nsenter -a -t <pid>
$ findmnt
$ exit
sibidharan commented 3 months ago

HI @ctalledo , sorry for delayed response.

I found more info about the (crub_all) process - it belongs to e2scrub_all.timer and e2scrub_reap.service

Screenshot from 2024-06-08 17-23-10

root@kali:/var/labsdata# systemctl status e2scrub_reap
● e2scrub_reap.service - Remove Stale Online ext4 Metadata Check Snapshots
     Loaded: loaded (/usr/lib/systemd/system/e2scrub_reap.service; enabled; preset: disabled)
     Active: activating (start) since Sat 2024-06-08 11:31:03 UTC; 26min ago
       Docs: man:e2scrub_all(8)
   Main PID: 45746 ((crub_all))
      Tasks: 1 (limit: 95232)
     Memory: 236.0K ()
     CGroup: /system.slice/e2scrub_reap.service
             └─45746 "(crub_all)"

Jun 08 11:31:03 kali.selfmade.ninja systemd[1]: Starting e2scrub_reap.service - Remove Stale Online ext4 Metadata Check Snapshots...

and this timer starts the this automatically on every sunday, and I guess this is not needed inside the container - am I correct? Disabling these services are enough or this has to be fixed in Sysbox also?

root@kali:/var/labsdata# systemctl status e2scrub_all.timer
● e2scrub_all.timer - Periodic ext4 Online Metadata Check for All Filesystems
     Loaded: loaded (/usr/lib/systemd/system/e2scrub_all.timer; enabled; preset: enabled)
     Active: active (waiting) since Sat 2024-06-08 11:15:58 UTC; 48min ago
    Trigger: Sun 2024-06-09 03:10:26 UTC; 15h left
   Triggers: ● e2scrub_all.service

Jun 08 11:15:58 kali.selfmade.ninja systemd[1]: Started e2scrub_all.timer - Periodic ext4 Online Metadata Check for All Filesystems.

Got it ... yes increase it to 0.5 secs (the minimum supported I believe): watch -n 0.5 ...

Hopefully that works; if it doesn't then we need to think of another approach to debug it ...

However, nothing showed up under watch, I am trying the

$ cat /proc//mountinfo | grep "/run/systemd/mount-rootfs/sys/devices/virtual"

root@kali:/usr/lib/systemd/system# cat /proc/87684/mountinfo | grep "/run/systemd/mount-rootfs/sys/devices/virtual"
45771 45596 0:6467 /sys/devices/virtual /run/systemd/mount-rootfs/sys/devices/virtual rw,nosuid,nodev,relatime - fuse sysboxfs rw,user_id=0,group_id=0,default_permissions,allow_other
45772 45771 0:6576 / /run/systemd/mount-rootfs/sys/devices/virtual/powercap ro,relatime - tmpfs tmpfs ro,uid=165536,gid=165536,inode64

nsenter didn't work btw. Hope this is enough info for you, let me know if you want more info.

ctalledo commented 2 months ago

Hi @sibidharan,

Thanks for the extra info.

Disabling these services are enough or this has to be fixed in Sysbox also?

Since we are having trouble understanding the issue fully, I can't yet say if it's a Sysbox bug or not.

I think the easiest path would be to disable the e2scrub services inside the container (they are probably running at host level anyway, so having them inside the container is likely redundant).

sibidharan commented 2 months ago

I can confirm disabling all e2scrub related services fixes the issue as of now.

bushev commented 4 weeks ago

Hey @ctalledo, Thank you for looking into that. I’m experiencing the same problem, but with a slight difference. In my case, e2scrub is not running inside the container at all, and disabling the service on the Sysbox host also doesn't resolve the issue. The CPU is still at 100% usage due to Sysbox FS. Logs are pretty much similar. Is there any solution for this?

bushev commented 4 weeks ago

The issue appeared out of nowhere after rebooting the host. I didn’t perform any updates or make any changes. I simply rebooted the hosts, and the problem appeared on three out of five of them. Furthermore, when I try to set up new hosts, they also exhibit this issue right from the start. This is very strange, and there seems to be some dependency on something that I can’t figure out.

ctalledo commented 4 weeks ago

Hi @bushev,

The CPU is still at 100% usage due to Sysbox FS

Mmm ... strange. Must be the workload inside the sysbox container triggering the problem.

Is there an image I can use to reproduce? Any info that allows me to reproduce this locally would help.

For context, one area that could trigger high CPU usage by sysbox-fs is when the Sysbox container is executing lots of mount or umount system calls, as those are trapped by Sysbox. Another possible area would be if the Sysbox container is executing lots of accesses to /proc/sys/..., as that also is partially emulated by Sysbox.

bushev commented 4 weeks ago

We are not making many requests to /proc/sys, and we are also not performing many mount or umount operations from inside the container. However, on the host machine, there is constant mounting and unmounting happening, with an average of around a thousand active mounts. These mounts are created and removed as users log in and out of the system.

Our image is over 20 GB, so it’s not easy to share it directly. I’ll need to slim it down and make sure the issue still reproduces, which will take some time. If the problem persists, I’ll get back to you.

bushev commented 3 weeks ago

Since this issue affected our production environment, I had to find a quick workaround to keep the service running. The only solution I’ve found so far is to restart the Sysbox service on the host immediately after starting each new container. This drops the CPU load from 80% down to just a few percent. So far, I haven’t noticed any disruptions in the operation of the containers or the applications running inside them.

I wanted to share this workaround here in case anyone else encounters the same problem and needs a temporary solution.

Kl0ven commented 3 weeks ago

Hi, I can reproduce a similar issue using the nestybox docker image. Hope that helps :)

services:
  dind-sandbox:
      image: nestybox/ubuntu-noble-systemd-docker:latest
      container_name: dind_sandbox
      restart: unless-stopped
      runtime: sysbox-runc

image

Note: Enabling debug logs on sysbox-fs slow it down enough to not use all the CPU.

sysbox-runc
         edition:         Community Edition (CE)
         version:         0.6.4
         commit:         085502643ea5281652c6984eed9797872f22698a
         built at:         Sat Apr  6 16:43:31 UTC 2024
         built by:         Cesar Talledo
         oci-specs:         1.1.0+dev
sysbox-mgr
        edition:         Community Edition (CE)
        version:         0.6.4
        commit:         03f5d7bc584fdcb2319b2c1831bd58581185fc1c
        built at:         Sat Apr  6 16:43:43 UTC 2024
        built by:         Cesar Talledo
sysbox-fs
        edition:         Community Edition (CE)
        version:         0.6.4
        commit:         1a678b72ac430009739fa6596b824f29b1f7fe2e
        built at:         Sat Apr  6 16:43:40 UTC 2024
        built by:         Cesar Talledo

[EDIT]: Added sysbox infos

ctalledo commented 1 day ago

Hi @Kl0ven, thanks for reporting.

Yes I am able to reproduce with the nestybox/ubuntu-noble-systemd image. There must be something in that image that is causing high CPU usage, I'll investigate.

Thanks.

ctalledo commented 1 day ago

For the nestybox/ubuntu-noble-systemd image, I confirmed that it's the e2scrub_reap.service that is causing the high-CPU usage inside the Sysbox container.

I need to understand what that service is doing and why it's consuming so many CPU cycles when running inside a Sysbox container. I suspect it's walking the container filesystem and causing high CPU utilization once it hits the files under /proc and /sys emulated by Sysbox, or doing many mount / unmounts syscalls trapped by Sysbox.

I disabled the service and CPU utilization goes back to normal.

sudo systemctl stop e2scrub_reap.service
ctalledo commented 1 day ago

I opted for a quick solution of disabling the e2scrub_reap.service in the nestybox/ubuntu-noble-systemd image. The rationale is here: https://github.com/nestybox/dockerfiles/pull/38

I've pushed the new image already to Dockerhub, so you should not see the problem using that image.

A "proper fix" will require digging down a bit more to figure out what that service is doing when running inside a Sysbox container that is causing CPU utilization to be so high.

sibidharan commented 1 day ago

Should also disable the related timer?

and this timer starts the this automatically on every sunday, and I guess this is not needed inside the container - am I correct? Disabling these services are enough or this has to be fixed in Sysbox also?

root@kali:/var/labsdata# systemctl status e2scrub_all.timer
● e2scrub_all.timer - Periodic ext4 Online Metadata Check for All Filesystems
     Loaded: loaded (/usr/lib/systemd/system/e2scrub_all.timer; enabled; preset: enabled)
     Active: active (waiting) since Sat 2024-06-08 11:15:58 UTC; 48min ago
    Trigger: Sun 2024-06-09 03:10:26 UTC; 15h left
   Triggers: ● e2scrub_all.service
ctalledo commented 23 hours ago

Hi @sibidharan,

Thanks; yes, we should disable that e2scrub_all.timer; I'll do that in the nestybox/ubuntu-noble-systemd image (and derivatives of it).