Closed GabyCT closed 3 years ago
I have replicated the kata setup mentioned by @GabyCT, and tried to set a value in /proc/sys/vm/nr_hugepages
inside a single container as follows:
sudo ctr t exec -d --exec-id "$(random_name)" test sh -c 'echo 100 | tee /proc/sys/vm/nr_hugepages
tee: /proc/sys/vm/nr_hugepages: Read-only file system
100
In the second experiment I have set kernel_params = "rw"
in the kata configuration.toml, and then ran the container with --privileged
parameter and got this result:
sudo ctr run --privileged --runtime io.containerd.run.kata.v2 -t --rm docker.io/library/busybox:latest hello sh
ctr: failed to create shim: failed to hotplug block device &{File:/dev/md0 Format:raw ID:drive-7800d0615bee700c MmioAddr: SCSIAddr: NvdimmID: VirtPath:/dev/vdj DevNo: PCIPath: Index:9 ShareRW:false ReadOnly:false Pmem:false Swap:false} error: 500 reason: VmAddDisk(VmAddDisk(DeviceManager(DetectImageType(Error { kind: UnexpectedEof, message: "failed to fill whole buffer" })))): unknown
@likebreath do you think you could take a look?
Closing this issue as it has been solved
For future reference:
The resolution was removing unused block devices under /dev
(e.g. snap and loop devices) from the host machine that was experiencing the issue (AWS instance).
It is likely due to the limit of supporting up to 32 devices from Cloud Hypervisor, given the default behavior of Kata when launching privileged container is sharing all block devices under /dev
to the guest/container (at least for Kata 1.x with Docker).
I am using the script of https://github.com/kata-containers/tests/blob/main/metrics/storage/webtooling.sh using CLH+containerd and I am enabling hugepages for CLH at the configuration.toml, I am also allocating the hugepages for the host and inside the containers, for the host I am using
$ echo 20480 | sudo tee /proc/sys/vm/nr_hugepages
and if I doThe way that I am allocating the images inside the container is doing
after line https://github.com/kata-containers/tests/blob/main/metrics/storage/webtooling.sh#L175, however, it seems that when running 20 or more containers we have issues like
Here it is the environment that I am using
And here are the logs
/cc @dborquez @likebreath