opencontainers / runc

CLI tool for spawning and running containers according to the OCI specification
https://www.opencontainers.org/
Apache License 2.0
11.75k stars 2.09k forks source link

CI Flaky test: TestSkipDevicesTrue: mkdir /sys/fs/cgroup/hugetlb/system.slice/system-runc_test_pods.slice: no such file or directory #3743

Closed rata closed 1 year ago

rata commented 1 year ago

On centos-7, this failure occurs from time to time:

=== RUN   TestSkipDevicesTrue
    systemd_test.go:136: mkdir /sys/fs/cgroup/hugetlb/system.slice/system-runc_test_pods.slice: no such file or directory
--- FAIL: TestSkipDevicesTrue (0.07s)
rata commented 1 year ago

Here is another failure, probably related to the same underlying issue: https://cirrus-ci.com/task/6425762278932480?logs=unit_tests#L319

=== RUN   TestSkipDevicesTrue
    systemd_test.go:187: open /sys/fs/cgroup/cpuset/system.slice/system-runc_test_pods.slice/cpuset.mems: no such file or directory
--- FAIL: TestSkipDevicesTrue (0.25s)
rata commented 1 year ago

I'm not familiar with those parts of runc, so if anyone can help to debug this it would be great! :)

fahedouch commented 1 year ago

/assign

rata commented 1 year ago

There is no prow here :)

fahedouch commented 1 year ago

I went through 0 fails for 300 runs on my local

[root@11b444ba6274 runc]# go test -v -run TestSkipDevicesTrue  ./libcontainer/cgroups/devices -count 300

centos7

[root@11b444ba6274 runc]# cat /etc/os-release 
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Cgroupv1

[root@11b444ba6274 runc]# df -h /sys/fs/cgroup/
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           616M     0  616M   0% /sys/fs/cgroup

hard to debug the origin without reproducing error on local

rata commented 1 year ago

So maybe it is related to some dirty state a test that runs before can leave? I mean, if running only that isolated test doesn't repro, that is the only idea that comes to mind now.

Other things might be the systemd version the CI vs you are using or some other system pkg/kernel.

fahedouch commented 1 year ago

hi @rata ,

thank for your insights. Indeed, re-running test in a non-isolated environnement produce some fails as shown above. After some debug, I feel the issue is coming from this line, the program start cpusetCopyIfNeeded before ensuring that the content of cgoup current directory is totally present. We may wait os.Mkdir(current, 0o755) a few seconds until we are done filling the current directory . I cannot prove this for the moment.