Closed rodnymolina closed 3 years ago
After investigating further, we noticed the problem was caused by a test error. The cgroup memory limit test causes the container to exceed its allocated cgroup limit, and in some machines (e.g., GCP ubuntu) this causes any further "docker exec" into the container to hang (because the docker exec causes the exec process to be entered into the cgroup, and since the cgroup's mem limit has been exceeded, the kernel pauses the process). The container is not killed since the test purposefully launched the container with the OOM killer disabled.
The fix is to modify the test such that after the container exceeds its mem limit, the test no longer docker execs into it (rather it uses nsenter to get whatever data it was going to get from inside the container).
Fix is here: https://github.com/nestybox/sysbox/pull/235
Sysbox-runc is consistently getting stuck while running
tests/cgroup/cgroup.bats:test_cgroup_memory()
testcase. Problem is reproduced in this setup:Sys container is able to register with sysbox-fs, so problems start at a very late stage in the initialization cycle. I made use of the debugger to iterate through sysbox-runc's initialization logic for both the parent and its children processes and didn't notice anything abnormal. Actually, the 'hang' is not observed till sysbox-runc's parent process is almost done with the container initialization, and by then, its grand-child process has already exec()ed to complete its initialization.
Looks like problem is somewhat related to the lack of
swap-memory-limitation
feature by this kernel (see generated log further below):Problem can be easily reproduced by spawning a sys container with this instruction:
If problem ends up being related to the absence of this kernel feature, sysbox-runc should identify this scenario and return a friendly message to the user. In either case, we should always return a prompt back to the user and avoid getting stuck.