google / marl

A hybrid thread / fiber task scheduler written in C++ 11
Apache License 2.0
1.88k stars 194 forks source link

Fixed a race condition causing some workers to perform multiple tasks serially while other workers sleep. #261

Closed natepaynefb closed 11 months ago

natepaynefb commented 11 months ago

This pull request significantly improves performance in some cases, and it addresses the issue discussed here: https://github.com/google/marl/issues/232

The Problem:

A single worker will often do multiple tasks serially while other workers are idle. The task does not need to be of short duration. A single worker could do two tasks that take multiple seconds each while other workers continue to sleep. This problem is especially common with the fork-and-join pattern. Suppose you wait for N workers to finish their last task, so that you now have N workers that are spinning for work. Then suppose you enqueue N new tasks. The problem is that some of the workers are likely to go to sleep prematurely, while the N tasks are performed by < N worker threads.

The Results:

This problem can be demonstrated with the following example code, which I instrumented using the Tracy profiler (https://github.com/wolfpld/tracy):

  constexpr int threadCount = 8;
  constexpr int repeatCount = 8;
  marl::Scheduler::Config cfg;
  cfg.workerThread.count = threadCount;
  marl::Scheduler scheduler(cfg);
  for (int repeat = 0; repeat < repeatCount; ++repeat) {
    marl::WaitGroup wg;
    for (int i = 0; i < threadCount; ++i) {
      wg.add(1);
      scheduler.enqueue(marl::Task{[wg]() {
        ZoneScopedN("DoTask");
        std::this_thread::sleep_for(std::chrono::milliseconds(10));
        wg.done();
      }});
    }
    wg.wait();
  }

A typical run of this code on my 32-core AMD Threadripper looks like this: marl_trace_before

In this example, we had 8 worker threads and we enqueued 8 tasks at a time. Ideally each task would run on a different worker thread so that they all execute concurrently. However, as you can see from the above image, a single worker often ends up doing multiple tasks while other worker(s) are idle. Here is what it looks like after applying the changes in this PR: marl_trace_after

The difference in performance is also seen in the new benchmark, which is included in this PR. Here are the results for my 32-core AMD Threadripper:

BEFORE:
--------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations
--------------------------------------------------------------------------------------------
Schedule/MultipleForkAndJoin/tasks:512/threads:0  14460951100 ns   14453125000 ns            1
Schedule/MultipleForkAndJoin/tasks:512/threads:1  14369679600 ns        0.000 ns            1
Schedule/MultipleForkAndJoin/tasks:512/threads:2  7483740600 ns        0.000 ns            1
Schedule/MultipleForkAndJoin/tasks:512/threads:4  4574613900 ns        0.000 ns            1
Schedule/MultipleForkAndJoin/tasks:512/threads:8  2448965490 ns        0.000 ns           10
Schedule/MultipleForkAndJoin/tasks:512/threads:16 1835931720 ns      1562500 ns           10
Schedule/MultipleForkAndJoin/tasks:512/threads:32  947968840 ns      3125000 ns           10
Schedule/MultipleForkAndJoin/tasks:512/threads:64  511230270 ns        0.000 ns           10

AFTER:
--------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations
--------------------------------------------------------------------------------------------
Schedule/MultipleForkAndJoin/tasks:512/threads:0  14724095300 ns   14656250000 ns            1
Schedule/MultipleForkAndJoin/tasks:512/threads:1  14939546400 ns     15625000 ns            1
Schedule/MultipleForkAndJoin/tasks:512/threads:2  7732024900 ns     15625000 ns            1
Schedule/MultipleForkAndJoin/tasks:512/threads:4  3668345300 ns        0.000 ns            1
Schedule/MultipleForkAndJoin/tasks:512/threads:8  1936010700 ns      3125000 ns           10
Schedule/MultipleForkAndJoin/tasks:512/threads:16  981641520 ns        0.000 ns           10
Schedule/MultipleForkAndJoin/tasks:512/threads:32  778781590 ns      4687500 ns           10
Schedule/MultipleForkAndJoin/tasks:512/threads:64  489064570 ns        0.000 ns           10

And here are the benchmark results measured on my Macbook Pro with 10 core M1 MAX:

BEFORE:
--------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations
--------------------------------------------------------------------------------------------
Schedule/MultipleForkAndJoin/tasks:512/threads:0  2.0409e+10 ns   2.0406e+10 ns            1
Schedule/MultipleForkAndJoin/tasks:512/threads:1  2.0370e+10 ns      2677000 ns            1
Schedule/MultipleForkAndJoin/tasks:512/threads:2  1.0619e+10 ns      1660000 ns            1
Schedule/MultipleForkAndJoin/tasks:512/threads:4  6006134250 ns      1007000 ns            1
Schedule/MultipleForkAndJoin/tasks:512/threads:8  5090412292 ns      1211000 ns            1
Schedule/MultipleForkAndJoin/tasks:512/threads:10 4149048084 ns      1398000 ns            1

AFTER:
--------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations
--------------------------------------------------------------------------------------------
Schedule/MultipleForkAndJoin/tasks:512/threads:0  2.0393e+10 ns   2.0392e+10 ns            1
Schedule/MultipleForkAndJoin/tasks:512/threads:1  2.0420e+10 ns      2813000 ns            1
Schedule/MultipleForkAndJoin/tasks:512/threads:2  1.0520e+10 ns      1683000 ns            1
Schedule/MultipleForkAndJoin/tasks:512/threads:4  5403471458 ns       963000 ns            1
Schedule/MultipleForkAndJoin/tasks:512/threads:8  4604350166 ns      1697000 ns            1
Schedule/MultipleForkAndJoin/tasks:512/threads:10 3125431250 ns      1866000 ns            1

Explanation of the Problem:

1) There were two race conditions in the spinForWork function. While spinning, it might discover that (work.num > 0), indicating that new work was assigned to it. It would then stop spinning, then try to lock the mutex, then discover that (work.num == 0) because another spinning worker stole the task from it. With no tasks to perform, the thread would then go to sleep. It should have continued spinning until it had a task (that could not be stolen from it), or until the timeout. Meanwhile, the enqueue function may have assigned a task to the same worker that just stole a task. That worker will end up with two tasks and the other worker will be asleep.

2) The second race condition was similar. While spinning for work, one thread might successfully steal work from another. It would then stop spinning, then lock the mutex, then discover that (work.num == 0) because the work was stolen from it. With no work, the thread would then go to sleep. Again, it should have continued spinning until it had work or until the timeout. I have observed that it is actually quite common for a task to be stolen multiple times in a row, when there are several spinning workers.

3) It is also possible for the enqueue function to assign a new task to a thread that already has work to do. This can happen while there other worker threads are asleep. If there are no spinning workers, then nobody can steal the extra task(s). This is another situation where one worker may spend a long time working on serial tasks while other workers sleep.

Solutions:

ben-clayton commented 11 months ago

Firstly, my apologies for the delay in responding to this - it's been a very busy few weeks.

The extremely thorougher write-up is very much appreciated - excellent work! The code changes also look good. I have no request for changes.

I did try benchmarking before and after this change, and I see mixed results:

go run ./tools/cmd/benchdiff/main.go old.txt new.txt

Delta               | Test name                                       | (A) old.txt  | (B) new.txt
+1.22x +26.989027ms | Schedule/Empty/tasks:262144/threads:8           | 123.310038ms | 150.299065ms
+1.20x +23.860334ms | Schedule/Empty/tasks:262144/threads:16          | 118.323588ms | 142.183922ms
+1.14x +15.550917ms | Schedule/Empty/tasks:262144/threads:128         | 107.884675ms | 123.435592ms
+1.14x +26.547827ms | Schedule/WaitGroup/tasks:262144/threads:8       | 189.861532ms | 216.409359ms
+1.14x +19.110006ms | Schedule/Empty/tasks:262144/threads:4           | 137.213622ms | 156.323628ms
+1.14x +23.777794ms | Schedule/WaitGroup/tasks:262144/threads:32      | 172.705833ms | 196.483627ms
+1.11x +12.103355ms | Schedule/Empty/tasks:262144/threads:32          | 114.35609ms  | 126.459445ms
+1.11x +72.549µs    | Schedule/Ticket/tasks:512/threads:1             | 687.316µs    | 759.865µs
+1.08x +8.820488ms  | Schedule/Empty/tasks:262144/threads:64          | 106.878291ms | 115.698779ms
+1.07x +14.902732ms | Schedule/WaitGroup/tasks:262144/threads:2       | 202.33975ms  | 217.242482ms
+1.07x +94.295µs    | Schedule/Event/tasks:512/threads:4              | 1.443245ms   | 1.53754ms
+1.06x +78.095µs    | Schedule/Event/tasks:512/threads:2              | 1.263009ms   | 1.341104ms
+1.05x +10.325158ms | Schedule/WaitGroup/tasks:262144/threads:4       | 205.738469ms | 216.063627ms
+1.04x +7.813932ms  | Schedule/WaitGroup/tasks:262144/threads:16      | 189.040758ms | 196.85469ms
+1.02x +2.051931ms  | Schedule/WaitGroup/tasks:262144/threads:1       | 133.807981ms | 135.859912ms
+1.02x +1.479942ms  | Schedule/EventBaton/tasks:262144/threads:32     | 96.728094ms  | 98.208036ms
+1.01x +15.027µs    | Schedule/Ticket/tasks:512/threads:128           | 1.06263ms    | 1.077657ms
+1.01x +2.318292ms  | Schedule/WaitGroup/tasks:262144/threads:128     | 171.273585ms | 173.591877ms
+1.01x +18.574µs    | Schedule/Event/tasks:512/threads:64             | 1.513371ms   | 1.531945ms
-1.01x -324ms       | Schedule/SomeWork/tasks:262144/threads:1        | 30.972s      | 30.648s
-1.01x -1.074193ms  | Schedule/EventBaton/tasks:262144/threads:64     | 96.596443ms  | 95.52225ms
-1.01x -4.38938ms   | MultiQueueTaskExecutor/tasks:262144/threads:128 | 387.045115ms | 382.655735ms
-1.01x -199ms       | Schedule/SomeWork/tasks:262144/threads:2        | 15.571s      | 15.372s
-1.01x -15.616µs    | Schedule/Ticket/tasks:512/threads:8             | 1.146066ms   | 1.13045ms
-1.02x -1.90551ms   | Schedule/EventBaton/tasks:262144/threads:4      | 97.675636ms  | 95.770126ms
-1.03x -47.992µs    | Schedule/Event/tasks:512/threads:32             | 1.568828ms   | 1.520836ms
-1.03x -406.474µs   | Schedule/Empty/tasks:262144/threads:0           | 12.695743ms  | 12.289269ms
-1.04x -53.122µs    | Schedule/Event/tasks:512/threads:128            | 1.531908ms   | 1.478786ms
-1.04x -3.788623ms  | Schedule/EventBaton/tasks:262144/threads:128    | 99.03754ms   | 95.248917ms
-1.05x -1.94971ms   | Schedule/Empty/tasks:262144/threads:1           | 39.77893ms   | 37.82922ms
-1.10x -93.617µs    | Schedule/Ticket/tasks:512/threads:64            | 1.05547ms    | 961.853µs
-1.11x -73.189µs    | Schedule/Event/tasks:512/threads:1              | 710.41µs     | 637.221µs
-1.12x -15.692897ms | Schedule/Empty/tasks:262144/threads:2           | 151.811635ms | 136.118738ms
-1.17x -205.383µs   | Schedule/Ticket/tasks:512/threads:2             | 1.43338ms    | 1.227997ms
-1.18x -25.829246ms | Schedule/WaitGroup/tasks:262144/threads:64      | 169.490816ms | 143.66157ms

This was run on Ubuntu 22.04.2, AMD Ryzen Threadripper 3990X.

There's a bit of a concerning performance loss with the Schedule/Empty/tasks:* benchmarks.

Do you see any delta with these benchmarks?

natepaynefb commented 11 months ago

I think you will find that the results of Schedule/Empty/tasks:* are highly variable form one run to the next. I believe the extreme variation comes from the non-deterministic way in which timing dictates the number of worker threads that actually wake up. It is possible for a single worker thread to process all of the empty tasks, no matter how many threads are available. This would happen if it could consume them as fast as the main thread could produce them. Similarly, it is possible that only two workers must wake in order to keep up with task production. In the worst case, all of the threads wake up and most of them spin for work most of the time. On a hyperthreaded CPU, you can even end up with one thread spinning for work on the same core as the main (producer) thread. It is also possible for that many spinning workers to cause the CPU to start thermal throttling. Whatever the factors may be, my observation is that performance varies greatly from one run to the next.

I ran the benchmarks on a 32-core (64 thread) AMD Threadripper PRO 5975WX. Here are the results form 3 back-to-back runs with the original code:

---------------------------------------------------------------------------------
Benchmark                                       Time             CPU   Iterations
---------------------------------------------------------------------------------
Schedule/Empty/tasks:262144/threads:0    21343721 ns     21139706 ns           34
Schedule/Empty/tasks:262144/threads:1    41795427 ns     41666667 ns           15
Schedule/Empty/tasks:262144/threads:2    48099814 ns     47991071 ns           14
Schedule/Empty/tasks:262144/threads:4   156185475 ns    156250000 ns            4
Schedule/Empty/tasks:262144/threads:8   139947117 ns    140625000 ns            6
Schedule/Empty/tasks:262144/threads:16  142485000 ns    140625000 ns            4
Schedule/Empty/tasks:262144/threads:32  143859900 ns    144531250 ns            4
Schedule/Empty/tasks:262144/threads:64  143889820 ns    143750000 ns            5

Schedule/Empty/tasks:262144/threads:0    21214815 ns     21139706 ns           34
Schedule/Empty/tasks:262144/threads:1    43976994 ns     44117647 ns           17
Schedule/Empty/tasks:262144/threads:2    67486189 ns     67708333 ns            9
Schedule/Empty/tasks:262144/threads:4   155623150 ns    152343750 ns            4
Schedule/Empty/tasks:262144/threads:8   137090983 ns    138020833 ns            6
Schedule/Empty/tasks:262144/threads:16  148879560 ns    150000000 ns            5
Schedule/Empty/tasks:262144/threads:32  145914740 ns    143750000 ns            5
Schedule/Empty/tasks:262144/threads:64  161941525 ns    160156250 ns            4

Schedule/Empty/tasks:262144/threads:0    19763218 ns     19761029 ns           34
Schedule/Empty/tasks:262144/threads:1    30788527 ns     30539773 ns           22
Schedule/Empty/tasks:262144/threads:2    38317411 ns     38194444 ns           18
Schedule/Empty/tasks:262144/threads:4   159280840 ns    159375000 ns            5
Schedule/Empty/tasks:262144/threads:8   147584650 ns    145833333 ns            6
Schedule/Empty/tasks:262144/threads:16  138205200 ns    137500000 ns            5
Schedule/Empty/tasks:262144/threads:32  135730400 ns    134375000 ns            5
Schedule/Empty/tasks:262144/threads:64  134153200 ns    135416667 ns            6

Now, here are three example runs after my proposed changes:

---------------------------------------------------------------------------------
Benchmark                                       Time             CPU   Iterations
---------------------------------------------------------------------------------
Schedule/Empty/tasks:262144/threads:0    20070584 ns     20270270 ns           37
Schedule/Empty/tasks:262144/threads:1    53956900 ns     53125000 ns           10
Schedule/Empty/tasks:262144/threads:2    42398021 ns     42410714 ns           14
Schedule/Empty/tasks:262144/threads:4    41524213 ns     41015625 ns           16
Schedule/Empty/tasks:262144/threads:8    63105473 ns     63920455 ns           11
Schedule/Empty/tasks:262144/threads:16  109079417 ns    106770833 ns            6
Schedule/Empty/tasks:262144/threads:32  139433360 ns    140625000 ns            5
Schedule/Empty/tasks:262144/threads:64  156029750 ns    156250000 ns            4

Schedule/Empty/tasks:262144/threads:0    20379318 ns     20220588 ns           34
Schedule/Empty/tasks:262144/threads:1    39143360 ns     39062500 ns           20
Schedule/Empty/tasks:262144/threads:2    63923360 ns     64062500 ns           10
Schedule/Empty/tasks:262144/threads:4    44248065 ns     44117647 ns           17
Schedule/Empty/tasks:262144/threads:8    58330253 ns     58333333 ns           15
Schedule/Empty/tasks:262144/threads:16   67999227 ns     68181818 ns           11
Schedule/Empty/tasks:262144/threads:32   76331433 ns     76388889 ns            9
Schedule/Empty/tasks:262144/threads:64   75278278 ns     76388889 ns            9

Schedule/Empty/tasks:262144/threads:0    19727641 ns     19425676 ns           37
Schedule/Empty/tasks:262144/threads:1    34787335 ns     34375000 ns           20
Schedule/Empty/tasks:262144/threads:2    53674570 ns     53125000 ns           10
Schedule/Empty/tasks:262144/threads:4    41981969 ns     41992188 ns           16
Schedule/Empty/tasks:262144/threads:8    61302036 ns     61079545 ns           11
Schedule/Empty/tasks:262144/threads:16  107162883 ns    106770833 ns            6
Schedule/Empty/tasks:262144/threads:32  137468780 ns    137500000 ns            5
Schedule/Empty/tasks:262144/threads:64  158588650 ns    156250000 ns            4

Before my changes, the timing of the 64-thread case varied between 134ms and 162ms, over 3 runs. After my change, the timing of the the 64-thread case varied between 75ms and 159ms, over 3 runs.

If there is a meaningful difference in performance before/after, then I'm afraid it is lost in the noise.

natepaynefb commented 11 months ago

Even after my proposed changes, we still sometimes have problems with slow tasks being processed on the same thread while other threads sleep. There are multiple ways for this to happen:

1) Suppose two threads are spinning for work. You schedule a job to the first worker. It is stolen by the second worker. Then you schedule a job to the second worker. That 2nd job won't start running until the first worker steals it. Hopefully that will happen quickly, but it depends on the random behavior of the job-stealing code. If there are many threads in the pool, then most attempts to steal a job will fail because most of the threads have none. The effectiveness of job stealing decreases as the number of threads increases. In the worst case the first thread may run out of time and give up, leaving the second thread to perform both tasks, no matter how long they take.

2) The enqueue function may fail to find a spinning worker. This can happen if there are no spinning workers, or if the try_lock happened to fail. In either case, it will fall back on the round-robin approach, which can select a worker that is already awake and doing work. In that case, that single thread will execute multiple tasks serially. In theory the waiting task could be stolen, but only if there are spinning workers. If all other workers are either doing work or asleep then the task will not be stolen.

I wonder what you would suggest regarding the situations above. It seems like #1 might be fixed if there was a way for the enqueue function to assign a task to a spinning worker and guarantee that it would be run on that worker (not be stolen the instant it was assigned).

One possible solution to #2 would be to pre-populate the spinning worker queue with all of the worker indices. When the enqueue function removes a non-negative index from the queue, it could guarantee that it will either use the index or put it back on the queue. In other words, we would not loose track of a spinning worker just because try_lock failed. If these things were true, then the enqueue function would only fail to find a spinning worker if all workers were awake and doing work, right? Note that all sleeping workers would be in the queue as well as spinning workers.

Do you understand the problems described above? What do you think of the possible solutions I listed? Do you have other suggestions? I may try some of these things myself if I can find the time. If I find solutions that don't hurt the existing benchmarks, then I may prepare a follow-up PR in the future. I like the features provided by marl, but I am going to need solutions to these problems if I am to continue using marl for my current project.

natepaynefb commented 11 months ago

I should clarify that my AMD benchmark results came from a Windows computer. When you run the tests, are the results consistent from one run to the next?

ben-clayton commented 11 months ago

For linux, yes, there was a consistent slowdown for the Schedule/Empty/tasks benchmarks. I also see a slow down on macOS, but less severe. Windows does look evenly noisy.

Let's just go for it. Those profile screenshots are compelling.

ben-clayton commented 11 months ago

Kokoro failures are unrelated. Landing.

Thank you very much for your excellent contribution!