Closed Levovar closed 4 years ago
@balintTobik
This is known problem and the wait for the cpuset was added to fix it. Is 10 seconds too short time to wait ?
no the problem is that we are waiting for the wrong thing to arrive, because the environment variables themselves are not fully provisioned when the container starts-up if we also re-read the env variables periodically together with checking the cpuset it will be okay!
i.e. we have seen cases when EXCLUSIVE_CPUS were set, but SHARED_CPUS were empty at start-up. if we never re-read, we will wait for a cpuset containing only EXCLUSIVE_CPUS
Okay, but that is a bit inconvenient for the application. I mean in general, not only for the CPU pooler users. It means that for any device allocation, the application has to wait for possible environment variables to be present. I assumed that all environment variables are ready when container starts.
I totally assumed it as well, but it looks like it is somehow container runtime specific / dependent
My advice stop get out of my device Tamara Tom bubby don't and all cgroup now or i will take the copy's i made to police get the picture
ok, AI :D
BTW solved by https://github.com/nokia/CPU-Pooler/pull/37
Describe the bug As per title. When resources are asked from multiple pools the Pod can go into an Error state after ten seconds with the error message: "Cgroup cpuset (25-26,36) does not match to expected cpuset (36)"
To Reproduce Steps to reproduce the behavior:
Expected behavior Container is running after ten seconds, and executing the command + argument described in its Pod spec
Additional context The error is thrown by process starter, which we mount into containers asking for exclusive cores since the PR introduced support for multi-pool allocations. The root cause is very prosaic: the provisioning of the environment variables SHARED_POOL and EXCLUSIVE_POOL are also done asynchronously by the container run-time :) When process starter reads them at the beginning: https://github.com/nokia/CPU-Pooler/blob/master/cmd/process-starter/process_starter.go#L124 it can happen only one of them exist. As we currently do not refresh the expected values, we will end up in a state where the provisioned cpuset will never match the faulty expectation in case any of the env vars were not yet available during initial startup. Solution is periodically re-reading the env variables also within https://github.com/nokia/CPU-Pooler/blob/master/cmd/process-starter/process_starter.go#L93