nestybox / sysbox

An open-source, next-generation "runc" that empowers rootless containers to run workloads such as Systemd, Docker, Kubernetes, just like VMs.
Apache License 2.0
2.78k stars 152 forks source link

"no space left on device" due to high inode usage #570

Closed ScottG489 closed 1 year ago

ScottG489 commented 2 years ago

On >v0.4.1 (currently v0.5.0and v.0.5.2) , it seems like using sysbox causes a larger number of inodes to be consumed which can pretty easily use up the systems entire allotment so that no more containers can start.

Unfortunately I wasn't able to reproduce this locally, but here's a small terraform config to stand up a machine that will reproduce the issue. Just be sure to supply a public key you can use to SSH in.

resource "aws_instance" "instance" {
  ami           = "ami-09dd2e08d601bff67"
  instance_type = "t3.small"
  vpc_security_group_ids = [aws_security_group.security_group.id]
  key_name = aws_key_pair.key_pair.key_name
}

resource "aws_security_group" "security_group" {
  name = "sg_foo"
  ingress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_key_pair" "key_pair" {
  key_name   = "kp_foo"
  public_key = "<your key here>"
}

Then SSH into the machine and run the following:

export SYSBOX_VERSION=0.5.2 ; \
sudo apt-get update \
  && sudo apt-get -y install ca-certificates curl gnupg lsb-release \
  && sudo mkdir -p /etc/apt/keyrings \
  && curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg \
  && echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null \
  && sudo apt-get update \
  && sudo apt-get -y install docker-ce docker-ce-cli containerd.io docker-compose-plugin \
  && wget https://downloads.nestybox.com/sysbox/releases/v${SYSBOX_VERSION}/sysbox-ce_${SYSBOX_VERSION}-0.linux_amd64.deb \
  && sudo apt-get install -y jq \
  && sudo apt-get install -y ./sysbox-ce_${SYSBOX_VERSION}-0.linux_amd64.deb \
  && sudo apt-get install -y linux-headers-$(uname -r)

To reproduce the actual error, you'll have to run a few containers without removing them:

for i in $(seq 1 50); do sudo docker run -it --runtime=sysbox-runc node:latest echo $i || break; done

inode usage should spike right away if the issue is reproducing which can be seen with df -i. It usually runs out of inodes and fails before iteration 30 (27 specifically from my experience) with an error like so:

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: container_linux.go:292: failed to chown rootfs clone caused: failed to invoke ChownClonedRootfs via grpc: rpc error: code = Unknown desc = failed to chown cloned rootfs bottom mount at /var/lib/sysbox/rootfs/194c0a6559cab0de061561c4275408fb7331a5855b619f025f7d5cc6ccb99c65/bottom/merged by offset 165536, 165536: chown /var/lib/sysbox/rootfs/194c0a6559cab0de061561c4275408fb7331a5855b619f025f7d5cc6ccb99c65/bottom/merged/usr/share/ca-certificates/mozilla/CA_Disig_Root_R2.crt to 165536:165536 failed: no space left on device: unknown.

A little more information about the system:

$ uname -a
Linux ip-<redacted> 5.4.0-1009-aws #9-Ubuntu SMP Sun Apr 12 19:46:01 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04 LTS
Release:        20.04
Codename:       focal

Installing shiftfs via the following seems to fix the issue:

sudo apt-get update \
  && sudo apt-get install -y make dkms git wget \
  && git clone -b k5.10 https://github.com/toby63/shiftfs-dkms.git shiftfs-k510 \
  && cd shiftfs-k510 \
  && git checkout k5.4 \
  && ./update1 \
  && sudo make -f Makefile.dkms \
  && modinfo shiftfs

The containers seem to start and exit much quicker as well.

I'd like to also test this on a newer kernel and will as soon as I'm able.

ctalledo commented 1 year ago

Another user reports hitting this same issue:

// runc test

$ df -i /data
Filesystem      Inodes  IUsed   IFree IUse% Mounted on
/dev/nvme1n1   6553600 209253 6344347    4% /data

$ sudo docker run --rm buddy/docker-cli:latest pwd
/
$ df -i /data
Filesystem      Inodes  IUsed   IFree IUse% Mounted on
/dev/nvme1n1   6553600 209253 6344347    4% /data                <<< No inode usage increase, as expected

// sysbox-runc test

$ df -i /data
Filesystem      Inodes  IUsed   IFree IUse% Mounted on
/dev/nvme1n1   6553600 209253 6344347    4% /data

$ sudo docker run --runtime=sysbox-runc --rm buddy/docker-cli:latest pwd
/
$ df -i /data
Filesystem      Inodes  IUsed   IFree IUse% Mounted on
/dev/nvme1n1   6553600 229167 6324433    4% /data            <<< Unexpected inode usage increase with Sysbox.
ctalledo commented 1 year ago

Update: issue occurs only when Sysbox (v0.5.2) runs on hosts without shiftfs. In this case Sysbox resorts to a technique called "rootfs cloning" and this is where the inode leakage occurs.

In the upcoming Sysbox release (release after v0.5.2), it's also possible to work-around this problem without shiftfs, if the host machine has kernel 5.19+.

The 5.19+ kernel has a feature that allows overlayfs to be mounted on top of "ID-mapped" mounts. The Sysbox top-of-tree leverages this kernel feature to set up the container's root filesystem inside the Linux user-namespace more easily/quickly, which works-around the inode leakage problem and makes Sysbox container startup/stop faster

DekusDenial commented 1 year ago

@ctalledo so for machine (likely non debian) with kernel < 5.19, are we recommended to install shiftfs via https://github.com/toby63/shiftfs-dkms, and load the module?

ctalledo commented 1 year ago

Hi @DekusDenial, yes correct, although unfortunately shiftfs has been flaky in kernels 5.15 -> 5.19. If you have kernels < 5.15, it should work fine though.

DekusDenial commented 1 year ago

Thx @ctalledo ! The fix commit for shiftfs (ref from your comment in here) seems to be back port to older kernels like 5.15 according to the ubuntu stream log so following the shiftfs module installation instruction in https://github.com/toby63/shiftfs-dkms should fix the flakiness you mentioned right?

ctalledo commented 1 year ago

The fix commit for shiftfs (ref from your comment in https://github.com/nestybox/sysbox/issues/596#issuecomment-1437883902) seems to be back port to older kernels like 5.15 according to the ubuntu stream log so following the shiftfs module installation instruction in https://github.com/toby63/shiftfs-dkms should fix the flakiness you mentioned right?

Yes I believe so; once that fix is backported then shiftfs should work fine on kernels 5.15 -> 5.19.

In general though, if your host can run with kernel 5.19+ and you install the upcoming Sysbox release, that's the best scenario as not only will it avoid shiftfs issues and the inode problem described by this PR, but also makes Sysbox perform better (container start/stop time is reduced, particularly when running Docker Engine inside the container).

DekusDenial commented 1 year ago

Noted! That’s my desire scenario as well. But my org is fixated on Amazon linux so getting hassle free kernel upgrade to beyond 5.15 might take AWS a while.

ctalledo commented 1 year ago

In this case Sysbox resorts to a technique called "rootfs cloning" and this is where the inode leakage occurs.

I investigated a bit more and it appears that the inode leakage is coming from within the overlayfs kernel module, and it occurs when sysbox performs a chown of the overlayfs mount dir (e.g., the container's rootfs "merged" dir). This normally causes overlayfs to copy-up file metadata to the overlayfs "upper" dir which should consume inodes, but those inodes should be released when the container stops and the overlayfs mount and the upper dir are removed. For some reason, the removal does not appear to be reducing the inode consumption.

Will investigate more.

ctalledo commented 1 year ago

Update:

For some reason, the removal does not appear to be reducing the inode consumption.

I confirmed this; overlayfs will only reduce the inode consumption when the overlayfs mount on the container's rootfs is explicitly unmounted and the overlayfs upper dir is removed. The problem is that when a container stops, the overlayfs mount on the container's rootfs is not explicitly unmounted, but rather implicitly unmounted (i.e., as a result of the container init process dying and the associated mount namespace being destroyed). In this case, it appears that overlayfs does not release the inodes associated with the upper dir for some reason, even if that upper dir is removed.

I am currently checking for a work-around, have not found one yet.

ctalledo commented 1 year ago

Found a work-around: this PR has the code and the explanation for it: https://github.com/nestybox/sysbox-mgr/pull/63

ScottG489 commented 1 year ago

Thanks @ctalledo!

I also thought I'd call out here that I think this is causing my computers that I develop on locally to spend upwards of 2-3 hours "clearing orphaned inode" on boot up. I've been seeing this for a while and couldn't figure out what it was and I just made the connection now while waiting for this to happen on my desktop :)

Thought I'd mention this here in case others were seeing the same thing.

ctalledo commented 1 year ago

Hi @ScottG489,

I also thought I'd call out here that I think this is causing my computers that I develop on locally to spend upwards of 2-3 hours "clearing orphaned inode" on boot up.

Yes I think you are correct. I hit this while rebooting my dev machine yesterday after reproducing this issue multiple times while investigating the fix.

Somehow the overlayfs inodes are becoming orphaned; as I mentioned above, I am pretty sure it's not a bug in Sysbox per-se, but rather Sysbox triggering a kernel bug when stacking overlayfs mounts in an effort to work-around the lack of shiftfs in hosts with kernel < 5.19. Fortunately, with kernel 5.19+ we don't need this stacking of overlayfs mounts and therefore the orphaned inode problem goes away.

ctalledo commented 1 year ago

Found a work-around: this PR has the code and the explanation for it: nestybox/sysbox-mgr#63

Unfortunately the work-around I found, while it fixes the inode leakage reported in this issue, has the undesired effect of breaking "docker build" and "docker commit" when the following conditions are met:

I tried hard to find a work-around without these downsides, but was not able to.

I've gone ahead and submitted the fix nonetheless, because having Sysbox leak inodes on "docker run" in hosts with kernel < 5.19 and no shiftfs is a more serious problem than breaking Docker commit/build with Sysbox in those hosts.

NOTE: the downside does not break docker commit / builds with the OCI runc (Docker's default runtime), and does not in any way affect the Docker builds when the Docker engine runs inside a Sysbox container. It only affects Docker builds/commits when the Docker engine at host level is configured with Sysbox as the default runtime (not a common setup).

Fix will be present in the release after v0.5.2.

ctalledo commented 1 year ago

Closing since the fix is now in top-of-tree.

ScottG489 commented 1 year ago

Thanks! Looking forward to the next release!