Large amount of redundant task migration

Description

Task migration are performed widely among scheduling operation, and it's a rather costly operation. In EEVDF/CFS , softirq are utilize to perform system wide CPU load balancing, which might include task migration .

Especially when system are heavy loaded or loads are unbalanced , task migration will happen frequently , take the following python code snippet to generate a mixture of CPU-bound and I/O-bound workloads which can cause significant CPU load unbalance for the system.

import threading
import os
import time

def io_bound_task(file_path, size_mb):
    with open(file_path, 'wb') as f:
        f.write(os.urandom(size_mb * 1024 * 1024))

    with open(file_path, 'rb') as f:
        data = f.read()

    os.remove(file_path)

def generate_io_load(num_threads, size_mb):
    threads = []
    for i in range(num_threads):
        file_path = f'test_file_{i}.dat'
        t = threading.Thread(target=io_bound_task, args=(file_path, size_mb))
        t.start()
        threads.append(t)

    for t in threads:
        t.join()

def cpu_bound_task(duration):
    end_time = time.time() + duration
    while time.time() < end_time:
        result = 0
        for i in range(10000):
            result += i ** 2

def generate_cpu_load(num_threads, duration):
    threads = []
    for i in range(num_threads):
        t = threading.Thread(target=cpu_bound_task, args=(duration,))
        t.start()
        threads.append(t)

    for t in threads:
        t.join()

if __name__ == "__main__":
    io_threads = threading.Thread(target=generate_io_load, args=(1200, 100))
    cpu_threads = threading.Thread(target=generate_cpu_load, args=(12, 300))

    io_threads.start()
    cpu_threads.start()

    io_threads.join()
    cpu_threads.join()

Using mpstat command to monitor per-CPU usage rate over 60 seconds and calculate the maximum load difference among all CPUs.

$ mpstat -P ALL 1 60

Comparing the result for EEVDF and scx_lavd

EEVDF Maximum CPU load imbalance over the period: 51.00%
LAVD Maximum CPU load imbalance over the period: 50.51%

Then we use perf stat to do the recording of the number of task migration performed.

The result for EEVDF is

$ sudo perf stat -e sched:sched_migrate_task -a sleep 60

 Performance counter stats for 'system wide':

            2,5733      sched:sched_migrate_task

      60.002513562 seconds time elapsed

The result for scx_lavd is

$ sudo perf stat -e sched:sched_migrate_task -a sleep 60

 Performance counter stats for 'system wide':

           12,8343      sched:sched_migrate_task

      60.004100279 seconds time elapsed

While having almost the same maximum CPU load unbalance ratio, scx_lavd peformed much more task migration than EEVDF. The situation doesn't just exist in scx_lavd , but also in scx_rusty , scx_central as far as I tested.

Possible enhancement

In linux kernel , there's a helper function can_migrate_task() to determine whether a selected task should be migrate or not due to its NUMA locality and cache affinity . We should have a similar mechanism to determine whether a task migration should happen or not in scx schedulers .

Since it's in scx provide an excellent framework for us to implement scheduler or load balancer in user space, just like scx_rusty and scx_rustland , I think we can build the mechanism in user space as well.

I would like to try and implement a machine learning based method regarding to solve this problem , which train a ML model that imitate the behavior of can_migrate_task() in linux kernel . To provide a light-weighted and fast inference to determine whether a task should be migrated or not. The idea comes from the reference paper linked below.

If it works out, we should be able to avoid redundant and even harmful task migration within scx schedulers.

What do you guys think? please let me know everything that should be considered and if it's possible I would like to start the experiments and do the implementation .

Reference Paper

Machine learning for load balancing in the Linux kernel

It of course depends on the workload but migrations aren't that expensive. There are several factors to consider:

Core to core distances are a lot closer in modern processors compared to e.g. shared bus multi-socket systems of the past.
Execution locality and work conservation are in tension. When a task is ready to run and the previous CPU it was running on is already occupied but there are other CPUs which are idle, there is a direct trade-off between them.
L1/2 caches are a lot smaller than L3 and can decay really fast - the longer a task waits for execution locality, the less valuable that locality becomes.

The production experience in the meta fleet has been that there isn't a lot of value in sacrificing work conservation for L1/2 locality (L3 is a different matter). It'd be probably a good idea to verify the ideas against some set of realistic workloads than trying to optimize a metric which may not matter.

@htejun - Really grateful for your suggestions! So what we should really considered is something like runqueue latency , since the motivation for task migration is to let the task to become running as fast as possible . The reason we do CPU load balance is to avoid CPU overloaded which may lead to longer runqueue latency , that's my understanding . I wonder if that's correct ?

If that's the case , task migration isn't that expensive and we're doing it in order to let the task become running faster, so we should expect to see shorter runqueue latency in scx schedulers such as scx_lavd , scx_central and scx_rusty than EEVDF (since the number of task migration is higher ).

I do a further experiment to observe the system's runqueue latency under heavy-loaded scenario and run the program I mentioned above. Using stress-ng --cpu 12 -l 100 to let the CPUs become heavy-loaded and run the program mentioned above to generate a mixture of I/O-bound and CPU-bound workloads.

The result for EEVDF is

$ sudo runqlat 5 12
Tracing run queue latency... Hit Ctrl-C to end.

     usecs               : count    distribution
         0 -> 1          : 9585     |****************************************|
         2 -> 3          : 4559     |*******************                     |
         4 -> 7          : 6865     |****************************            |
         8 -> 15         : 7453     |*******************************         |
        16 -> 31         : 4279     |*****************                       |
        32 -> 63         : 1466     |******                                  |
        64 -> 127        : 426      |*                                       |
       128 -> 255        : 220      |                                        |
       256 -> 511        : 193      |                                        |
       512 -> 1023       : 381      |*                                       |
      1024 -> 2047       : 1069     |****                                    |
      2048 -> 4095       : 6936     |****************************            |
      4096 -> 8191       : 1987     |********                                |
      8192 -> 16383      : 348      |*                                       |
     16384 -> 32767      : 20       |                                        |
[...]

The result for scx_lavd is

$ sudo runqlat 5 12
Tracing run queue latency... Hit Ctrl-C to end.

     usecs               : count    distribution
         0 -> 1          : 2125     |************                            |
         2 -> 3          : 4143     |************************                |
         4 -> 7          : 2562     |**************                          |
         8 -> 15         : 2050     |***********                             |
        16 -> 31         : 1288     |*******                                 |
        32 -> 63         : 607      |***                                     |
        64 -> 127        : 369      |**                                      |
       128 -> 255        : 286      |*                                       |
       256 -> 511        : 384      |**                                      |
       512 -> 1023       : 4793     |****************************            |
      1024 -> 2047       : 3237     |******************                      |
      2048 -> 4095       : 6846     |****************************************|
      4096 -> 8191       : 438      |**                                      |
      8192 -> 16383      : 7        |                                        |
[...]

The result for scx_central is

$ sudo runqlat 5 12
Tracing run queue latency... Hit Ctrl-C to end.

     usecs               : count    distribution
         0 -> 1          : 28       |                                        |
         2 -> 3          : 151      |                                        |
         4 -> 7          : 27       |                                        |
         8 -> 15         : 47       |                                        |
        16 -> 31         : 656      |***                                     |
        32 -> 63         : 2365     |***********                             |
        64 -> 127        : 2817     |*************                           |
       128 -> 255        : 2204     |**********                              |
       256 -> 511        : 1702     |********                                |
       512 -> 1023       : 8124     |****************************************|
      1024 -> 2047       : 7656     |*************************************   |
      2048 -> 4095       : 6815     |*********************************       |
      4096 -> 8191       : 2084     |**********                              |
      8192 -> 16383      : 267      |*                                       |
     16384 -> 32767      : 15       |                                        |
[...]

We can see from the result that the distribution of runqueue latency is better in EEVDF , which isn't the same as I expected to see, because I assume scx_cetral or scx_lavd does more task migration so should have better distribution of runqueue latency .

So maybe this is worth trying to determine whether a task should be migrated or not ? about the realistic workload you mentioned, do you have any suggestions for scx_central for this kind of situation ? because I think scx_central is a rather simple case in which I can focus more on the mechanism of task migration determination . Other scheduler have many metrics and factors to considered, the case won't be that pure.

I'd strongly recommend finding somewhat realistic workload and working from there. Not that these intermediate metrics aren't important but they can be really misleading in isolation. They can only be interpreted in the context of what the worlkoad is actually doing and what the desirable outcome is. e.g. Imagine two schedulers, one schedules all threads in global FIFO order running them in very short bursts, the other classifying tasks to interactive and batch and then applying different time slices and execution cadence. The former will result in better runqlat but you probably don't want to use it as the scheduler on your machine.

I'd strongly recommend finding somewhat realistic workload and working from there. Not that these intermediate metrics aren't important but they can be really misleading in isolation. They can only be interpreted in the context of what the worlkoad is actually doing and what the desirable outcome is. e.g. Imagine two schedulers, one schedules all threads in global FIFO order running them in very short bursts, the other classifying tasks to interactive and batch and then applying different time slices and execution cadence. The former will result in better runqlat but you probably don't want to use it as the scheduler on your machine.

I get what you're saying, what we should really focus on is the metric which will affect the performance of our workloads directly (e.g. FPS for gaming experiences ), that's why I should start from a realistic workload. Intermediate metrics like runqlat might varies due different scenario , I should check what kind of realistic workloads that will trigger similar behavior like the program I mentioned above . I guess that's what you're talking about ?

@htejun - What if I generate the workloads through containers ? for example, 5 containers running I/O-intensive work like running database , 5 containers running CPU-intensive work like training ML model or doing large amount of calculation , while using scx_central which claims to be more friendly to VM server and I assume it should work well with containers as well . Do you think this is an appropriate scenario to run as a test ?

I don't think I can guide through specific scenarios, but I personally find things to be a lot easier to understand and build upon if it's something I'm intuitively familiar with. There of course are multiple ways to go about things but it might be useful to start with something you understand intuitively and keep expanding your intuition on how scheduling interacts with the workload and what aspects of scheduling behavior different measurements and metrics capture.

@htejun - I guess I know what you're talking about now, thanks so much for your precious opinion and discussion with me. I know more about what I really should focus on as the top priority is the metrics which affect the workloads directly, rather than overemphasize on others . I'll re-think everything again , I'll close the issue for now. Thanks again for your detailed insight, patience and kindness .

sched-ext / scx