sched-ext / scx

sched_ext schedulers and tools
https://bit.ly/scx_slack
GNU General Public License v2.0
692 stars 48 forks source link

scx_rustland: introduce virtual time slice #333

Closed arighi closed 1 month ago

arighi commented 1 month ago

Overview

Currently, a task's time slice is determined based on the total number of tasks waiting to be scheduled: the more overloaded the system, the shorter the time slice.

This approach can help to reduce the average wait time of all tasks, allowing them to progress more slowly, but uniformly, thus providing a smoother overall system performance.

However, under heavy system load, this approach can lead to very short time slices distributed among all tasks, causing excessive context switches that can badly affect soft real-time workloads.

Moreover, the scheduler tends to operate in a bursty manner (tasks are queued and dispatched in bursts). This can also result in fluctuations of longer and shorter time slices, depending on the number of tasks still waiting in the scheduler's queue.

Such behavior can also negatively impact on soft real-time workloads, such as real-time audio processing.

Virtual time slice

To mitigate this problem, introduce the concept of virtual time slice: the idea is to evaluate the optimal time slice of a task, considering the vruntime as a deadline for the task to complete its work before releasing the CPU.

This is accomplished by calculating the difference between the task's vruntime and the global current vruntime and use this value as the task time slice:

task_slice = task_vruntime - min_vruntime

In this way, tasks that "promise" to release the CPU quickly (based on their previous work pattern) get a much higher priority (due to vruntime-based scheduling and the additional priority boost for being classified as interactive), but they are also given a shorter time slice to complete their work and fulfill their promise of rapidity.

At the same time tasks that are more CPU-intensive get de-prioritized, but they will tend to have a longer time slice available, reducing in this way the amount of context switches that can negatively affect their performance.

In conclusion, latency-sensitive tasks get a high priority and a short time slice (and they can preempt other tasks), CPU-intensive tasks get low priority and a long time slice.

Example

Let's consider the following theoretical scenario:

task | time -----+----- A | 1 B | 3 C | 6 D | 6

In this case task A represents a short interactive task, task C and D are CPU-intensive tasks and task B is mainly interactive, but it also requires some CPU time.

With a uniform time slice, scaled based on the amount of tasks, the scheduling looks like this (assuming the time slice is 2):

A B B C C D D A B C C D D C C D D | | | | | | | | | --------------------`----> 9 context switches

With the virtual time slice the scheduling changes to this:

A B B C C C D A B C C C D D D D D | | | | | | | ----------------`----------> 7 context switches

In the latter scenario, tasks do not receive the same time slice scaled by the total number of tasks waiting to be scheduled. Instead, their time slice is adjusted based on their previous CPU usage. Tasks that used more CPU time are given longer slices and their processing time tends to be packed together, reducing the amount of context switches.

Meanwhile, latency-sensitive tasks can still be processed as soon as they need to, because they get a higher priority and they can preempt other tasks. However, they will get a short time slice, so tasks that were incorrectly classified as interactive will still be forced to release the CPU quickly.

Experimental results

This patch has been tested on a on a 8-cores AMD Ryzen 7 5800X 8-Core Processor (16 threads with SMT), 16GB RAM, NVIDIA GeForce RTX 3070.

The test case involves the usual benchmark of playing a video game while simultaneously overloading the system with a parallel kernel build (make -j32).

The average frames per second (fps) reported by Steam is used as a metric for measuring system responsiveness (the higher the better):

Game | before | after | delta | ---------------------------+---------+---------+--------+ Baldur's Gate 3 | 40 fps | 48 fps | +20.0% | Counter-Strike 2 | 8 fps | 15 fps | +87.5% | Cyberpunk 2077 | 41 fps | 46 fps | +12.2% | Terraria | 98 fps | 108 fps | +10.2% | Team Fortress 2 | 81 fps | 92 fps | +13.6% | WebGL demo (firefox) [1] | 32 fps | 42 fps | +31.2% | ---------------------------+---------+---------+--------+

Apart from the massive boost with Counter-Strike 2 (that should be taken with a grain of salt, considering the overall poor performance in both cases), the virtual time slice seems to systematically provide a boost in responsiveness of around +10-20% fps.

It also seems to significantly prevent potential audio cracking issues when the system is massively overloaded: no audio cracking was detected during the entire run of these tests with the virtual deadline change applied.

[1] https://webglsamples.org/aquarium/aquarium.html