gVisor currently implements CPU clocks as follows:
A per-sentry "CPU clock ticker goroutine"
(task_sched.go:Kernel.runCPUClockTicker()) periodically advances
Kernel.cpuClock, causing it to serve as a very coarse but inexpensive
monotonic wall clock (that happens to be suspended when no tasks are
running).
Task goroutines observe the most recent value of Kernel.cpuClock when
changing state (Task.gosched.Timestamp), and use it to compute the number of
CPU clock ticks that have elapsed in a given state. Thus, task CPU clocks are
approximately based on the wall time during which they were marked as
running.
ITIMER_VIRTUAL, ITIMER_PROF, and RLIMIT_CPU are checked by the CPU clock
ticker goroutine after advancing Kernel.cpuClock. POSIX interval timers and
timerfds check CPU clocks (taskClock/tgClock) in ktime.SampledTimer
goroutines.
This has three major problems:
ktime.SampledTimer goroutines for CPU clock timers run concurrently with the
CPU clock ticker, and are not informed as to when corresponding tasks start
or stop running (due to overhead on the task execution critical path), so
they can't determine when CPU clocks have/will advance; instead, they simply
poll CPU clocks on a period equal to that of the represented timer, resulting
in significant overhead for CPU-clock-based POSIX interval timers and
timerfds.
For the same reason, CPU clock interval timers and timerfds may expire much
later than when the CPU clock is actually incremented; in the interval timer
case, this can result in notification signals being sent long after tasks
have stopped running. (This is the same problem as in b/116538398, which
motivated the special-casing of ITIMER_VIRTUAL and ITIMER_PROF described
above, but applied to POSIX interval timers.)
The sentry does not impose a limit on the number of tasks that may be
concurrently marked running, so if more tasks are marked running than the
number of CPUs advertised to applications, application CPU utilization can
appear to exceed 100%.
This CL fixes these problems by introducing explicit per-Task and ThreadGroup
CPU clocks, directly advancing (up to Kernel.applicationCores of) them in the
CPU clock ticker, and directly expiring CPU timers when doing so. Itimer and
RLIMIT_CPU timers lose their special-casing and instead behave like other CPU
timers (see task_acct.go). Kernel.cpuClock is still required, but only for the
sentry watchdog.
kernel: drive all CPU timers in CPU clock ticker
gVisor currently implements CPU clocks as follows:
A per-sentry "CPU clock ticker goroutine" (task_sched.go:Kernel.runCPUClockTicker()) periodically advances Kernel.cpuClock, causing it to serve as a very coarse but inexpensive monotonic wall clock (that happens to be suspended when no tasks are running).
Task goroutines observe the most recent value of Kernel.cpuClock when changing state (Task.gosched.Timestamp), and use it to compute the number of CPU clock ticks that have elapsed in a given state. Thus, task CPU clocks are approximately based on the wall time during which they were marked as running.
ITIMER_VIRTUAL, ITIMER_PROF, and RLIMIT_CPU are checked by the CPU clock ticker goroutine after advancing Kernel.cpuClock. POSIX interval timers and timerfds check CPU clocks (taskClock/tgClock) in ktime.SampledTimer goroutines.
This has three major problems:
ktime.SampledTimer goroutines for CPU clock timers run concurrently with the CPU clock ticker, and are not informed as to when corresponding tasks start or stop running (due to overhead on the task execution critical path), so they can't determine when CPU clocks have/will advance; instead, they simply poll CPU clocks on a period equal to that of the represented timer, resulting in significant overhead for CPU-clock-based POSIX interval timers and timerfds.
For the same reason, CPU clock interval timers and timerfds may expire much later than when the CPU clock is actually incremented; in the interval timer case, this can result in notification signals being sent long after tasks have stopped running. (This is the same problem as in b/116538398, which motivated the special-casing of ITIMER_VIRTUAL and ITIMER_PROF described above, but applied to POSIX interval timers.)
The sentry does not impose a limit on the number of tasks that may be concurrently marked running, so if more tasks are marked running than the number of CPUs advertised to applications, application CPU utilization can appear to exceed 100%.
This CL fixes these problems by introducing explicit per-Task and ThreadGroup CPU clocks, directly advancing (up to Kernel.applicationCores of) them in the CPU clock ticker, and directly expiring CPU timers when doing so. Itimer and RLIMIT_CPU timers lose their special-casing and instead behave like other CPU timers (see task_acct.go). Kernel.cpuClock is still required, but only for the sentry watchdog.
Minor cleanup changes:
Gather all stateify hooks in kernel_state.go.
Replace kernel.randInt31n() with math/rand/v2, which fixes the same problem (https://go.dev/blog/randv2#problem.rand).
Test workload:
Before this CL:
After this CL: