Closed natepaynefb closed 11 months ago
Firstly, my apologies for the delay in responding to this - it's been a very busy few weeks.
The extremely thorougher write-up is very much appreciated - excellent work! The code changes also look good. I have no request for changes.
I did try benchmarking before and after this change, and I see mixed results:
go run ./tools/cmd/benchdiff/main.go old.txt new.txt
Delta | Test name | (A) old.txt | (B) new.txt
+1.22x +26.989027ms | Schedule/Empty/tasks:262144/threads:8 | 123.310038ms | 150.299065ms
+1.20x +23.860334ms | Schedule/Empty/tasks:262144/threads:16 | 118.323588ms | 142.183922ms
+1.14x +15.550917ms | Schedule/Empty/tasks:262144/threads:128 | 107.884675ms | 123.435592ms
+1.14x +26.547827ms | Schedule/WaitGroup/tasks:262144/threads:8 | 189.861532ms | 216.409359ms
+1.14x +19.110006ms | Schedule/Empty/tasks:262144/threads:4 | 137.213622ms | 156.323628ms
+1.14x +23.777794ms | Schedule/WaitGroup/tasks:262144/threads:32 | 172.705833ms | 196.483627ms
+1.11x +12.103355ms | Schedule/Empty/tasks:262144/threads:32 | 114.35609ms | 126.459445ms
+1.11x +72.549µs | Schedule/Ticket/tasks:512/threads:1 | 687.316µs | 759.865µs
+1.08x +8.820488ms | Schedule/Empty/tasks:262144/threads:64 | 106.878291ms | 115.698779ms
+1.07x +14.902732ms | Schedule/WaitGroup/tasks:262144/threads:2 | 202.33975ms | 217.242482ms
+1.07x +94.295µs | Schedule/Event/tasks:512/threads:4 | 1.443245ms | 1.53754ms
+1.06x +78.095µs | Schedule/Event/tasks:512/threads:2 | 1.263009ms | 1.341104ms
+1.05x +10.325158ms | Schedule/WaitGroup/tasks:262144/threads:4 | 205.738469ms | 216.063627ms
+1.04x +7.813932ms | Schedule/WaitGroup/tasks:262144/threads:16 | 189.040758ms | 196.85469ms
+1.02x +2.051931ms | Schedule/WaitGroup/tasks:262144/threads:1 | 133.807981ms | 135.859912ms
+1.02x +1.479942ms | Schedule/EventBaton/tasks:262144/threads:32 | 96.728094ms | 98.208036ms
+1.01x +15.027µs | Schedule/Ticket/tasks:512/threads:128 | 1.06263ms | 1.077657ms
+1.01x +2.318292ms | Schedule/WaitGroup/tasks:262144/threads:128 | 171.273585ms | 173.591877ms
+1.01x +18.574µs | Schedule/Event/tasks:512/threads:64 | 1.513371ms | 1.531945ms
-1.01x -324ms | Schedule/SomeWork/tasks:262144/threads:1 | 30.972s | 30.648s
-1.01x -1.074193ms | Schedule/EventBaton/tasks:262144/threads:64 | 96.596443ms | 95.52225ms
-1.01x -4.38938ms | MultiQueueTaskExecutor/tasks:262144/threads:128 | 387.045115ms | 382.655735ms
-1.01x -199ms | Schedule/SomeWork/tasks:262144/threads:2 | 15.571s | 15.372s
-1.01x -15.616µs | Schedule/Ticket/tasks:512/threads:8 | 1.146066ms | 1.13045ms
-1.02x -1.90551ms | Schedule/EventBaton/tasks:262144/threads:4 | 97.675636ms | 95.770126ms
-1.03x -47.992µs | Schedule/Event/tasks:512/threads:32 | 1.568828ms | 1.520836ms
-1.03x -406.474µs | Schedule/Empty/tasks:262144/threads:0 | 12.695743ms | 12.289269ms
-1.04x -53.122µs | Schedule/Event/tasks:512/threads:128 | 1.531908ms | 1.478786ms
-1.04x -3.788623ms | Schedule/EventBaton/tasks:262144/threads:128 | 99.03754ms | 95.248917ms
-1.05x -1.94971ms | Schedule/Empty/tasks:262144/threads:1 | 39.77893ms | 37.82922ms
-1.10x -93.617µs | Schedule/Ticket/tasks:512/threads:64 | 1.05547ms | 961.853µs
-1.11x -73.189µs | Schedule/Event/tasks:512/threads:1 | 710.41µs | 637.221µs
-1.12x -15.692897ms | Schedule/Empty/tasks:262144/threads:2 | 151.811635ms | 136.118738ms
-1.17x -205.383µs | Schedule/Ticket/tasks:512/threads:2 | 1.43338ms | 1.227997ms
-1.18x -25.829246ms | Schedule/WaitGroup/tasks:262144/threads:64 | 169.490816ms | 143.66157ms
This was run on Ubuntu 22.04.2, AMD Ryzen Threadripper 3990X.
There's a bit of a concerning performance loss with the Schedule/Empty/tasks:*
benchmarks.
Do you see any delta with these benchmarks?
I think you will find that the results of Schedule/Empty/tasks:*
are highly variable form one run to the next. I believe the extreme variation comes from the non-deterministic way in which timing dictates the number of worker threads that actually wake up. It is possible for a single worker thread to process all of the empty tasks, no matter how many threads are available. This would happen if it could consume them as fast as the main thread could produce them. Similarly, it is possible that only two workers must wake in order to keep up with task production. In the worst case, all of the threads wake up and most of them spin for work most of the time. On a hyperthreaded CPU, you can even end up with one thread spinning for work on the same core as the main (producer) thread. It is also possible for that many spinning workers to cause the CPU to start thermal throttling. Whatever the factors may be, my observation is that performance varies greatly from one run to the next.
I ran the benchmarks on a 32-core (64 thread) AMD Threadripper PRO 5975WX. Here are the results form 3 back-to-back runs with the original code:
---------------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------------
Schedule/Empty/tasks:262144/threads:0 21343721 ns 21139706 ns 34
Schedule/Empty/tasks:262144/threads:1 41795427 ns 41666667 ns 15
Schedule/Empty/tasks:262144/threads:2 48099814 ns 47991071 ns 14
Schedule/Empty/tasks:262144/threads:4 156185475 ns 156250000 ns 4
Schedule/Empty/tasks:262144/threads:8 139947117 ns 140625000 ns 6
Schedule/Empty/tasks:262144/threads:16 142485000 ns 140625000 ns 4
Schedule/Empty/tasks:262144/threads:32 143859900 ns 144531250 ns 4
Schedule/Empty/tasks:262144/threads:64 143889820 ns 143750000 ns 5
Schedule/Empty/tasks:262144/threads:0 21214815 ns 21139706 ns 34
Schedule/Empty/tasks:262144/threads:1 43976994 ns 44117647 ns 17
Schedule/Empty/tasks:262144/threads:2 67486189 ns 67708333 ns 9
Schedule/Empty/tasks:262144/threads:4 155623150 ns 152343750 ns 4
Schedule/Empty/tasks:262144/threads:8 137090983 ns 138020833 ns 6
Schedule/Empty/tasks:262144/threads:16 148879560 ns 150000000 ns 5
Schedule/Empty/tasks:262144/threads:32 145914740 ns 143750000 ns 5
Schedule/Empty/tasks:262144/threads:64 161941525 ns 160156250 ns 4
Schedule/Empty/tasks:262144/threads:0 19763218 ns 19761029 ns 34
Schedule/Empty/tasks:262144/threads:1 30788527 ns 30539773 ns 22
Schedule/Empty/tasks:262144/threads:2 38317411 ns 38194444 ns 18
Schedule/Empty/tasks:262144/threads:4 159280840 ns 159375000 ns 5
Schedule/Empty/tasks:262144/threads:8 147584650 ns 145833333 ns 6
Schedule/Empty/tasks:262144/threads:16 138205200 ns 137500000 ns 5
Schedule/Empty/tasks:262144/threads:32 135730400 ns 134375000 ns 5
Schedule/Empty/tasks:262144/threads:64 134153200 ns 135416667 ns 6
Now, here are three example runs after my proposed changes:
---------------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------------
Schedule/Empty/tasks:262144/threads:0 20070584 ns 20270270 ns 37
Schedule/Empty/tasks:262144/threads:1 53956900 ns 53125000 ns 10
Schedule/Empty/tasks:262144/threads:2 42398021 ns 42410714 ns 14
Schedule/Empty/tasks:262144/threads:4 41524213 ns 41015625 ns 16
Schedule/Empty/tasks:262144/threads:8 63105473 ns 63920455 ns 11
Schedule/Empty/tasks:262144/threads:16 109079417 ns 106770833 ns 6
Schedule/Empty/tasks:262144/threads:32 139433360 ns 140625000 ns 5
Schedule/Empty/tasks:262144/threads:64 156029750 ns 156250000 ns 4
Schedule/Empty/tasks:262144/threads:0 20379318 ns 20220588 ns 34
Schedule/Empty/tasks:262144/threads:1 39143360 ns 39062500 ns 20
Schedule/Empty/tasks:262144/threads:2 63923360 ns 64062500 ns 10
Schedule/Empty/tasks:262144/threads:4 44248065 ns 44117647 ns 17
Schedule/Empty/tasks:262144/threads:8 58330253 ns 58333333 ns 15
Schedule/Empty/tasks:262144/threads:16 67999227 ns 68181818 ns 11
Schedule/Empty/tasks:262144/threads:32 76331433 ns 76388889 ns 9
Schedule/Empty/tasks:262144/threads:64 75278278 ns 76388889 ns 9
Schedule/Empty/tasks:262144/threads:0 19727641 ns 19425676 ns 37
Schedule/Empty/tasks:262144/threads:1 34787335 ns 34375000 ns 20
Schedule/Empty/tasks:262144/threads:2 53674570 ns 53125000 ns 10
Schedule/Empty/tasks:262144/threads:4 41981969 ns 41992188 ns 16
Schedule/Empty/tasks:262144/threads:8 61302036 ns 61079545 ns 11
Schedule/Empty/tasks:262144/threads:16 107162883 ns 106770833 ns 6
Schedule/Empty/tasks:262144/threads:32 137468780 ns 137500000 ns 5
Schedule/Empty/tasks:262144/threads:64 158588650 ns 156250000 ns 4
Before my changes, the timing of the 64-thread case varied between 134ms and 162ms, over 3 runs. After my change, the timing of the the 64-thread case varied between 75ms and 159ms, over 3 runs.
If there is a meaningful difference in performance before/after, then I'm afraid it is lost in the noise.
Even after my proposed changes, we still sometimes have problems with slow tasks being processed on the same thread while other threads sleep. There are multiple ways for this to happen:
1) Suppose two threads are spinning for work. You schedule a job to the first worker. It is stolen by the second worker. Then you schedule a job to the second worker. That 2nd job won't start running until the first worker steals it. Hopefully that will happen quickly, but it depends on the random behavior of the job-stealing code. If there are many threads in the pool, then most attempts to steal a job will fail because most of the threads have none. The effectiveness of job stealing decreases as the number of threads increases. In the worst case the first thread may run out of time and give up, leaving the second thread to perform both tasks, no matter how long they take.
2) The enqueue
function may fail to find a spinning worker. This can happen if there are no spinning workers, or if the try_lock
happened to fail. In either case, it will fall back on the round-robin approach, which can select a worker that is already awake and doing work. In that case, that single thread will execute multiple tasks serially. In theory the waiting task could be stolen, but only if there are spinning workers. If all other workers are either doing work or asleep then the task will not be stolen.
I wonder what you would suggest regarding the situations above. It seems like #1 might be fixed if there was a way for the enqueue
function to assign a task to a spinning worker and guarantee that it would be run on that worker (not be stolen the instant it was assigned).
One possible solution to #2 would be to pre-populate the spinning worker queue with all of the worker indices. When the enqueue
function removes a non-negative index from the queue, it could guarantee that it will either use the index or put it back on the queue. In other words, we would not loose track of a spinning worker just because try_lock
failed. If these things were true, then the enqueue
function would only fail to find a spinning worker if all workers were awake and doing work, right? Note that all sleeping workers would be in the queue as well as spinning workers.
Do you understand the problems described above? What do you think of the possible solutions I listed? Do you have other suggestions? I may try some of these things myself if I can find the time. If I find solutions that don't hurt the existing benchmarks, then I may prepare a follow-up PR in the future. I like the features provided by marl, but I am going to need solutions to these problems if I am to continue using marl for my current project.
I should clarify that my AMD benchmark results came from a Windows computer. When you run the tests, are the results consistent from one run to the next?
For linux, yes, there was a consistent slowdown for the Schedule/Empty/tasks
benchmarks.
I also see a slow down on macOS, but less severe. Windows does look evenly noisy.
Let's just go for it. Those profile screenshots are compelling.
Kokoro failures are unrelated. Landing.
Thank you very much for your excellent contribution!
This pull request significantly improves performance in some cases, and it addresses the issue discussed here: https://github.com/google/marl/issues/232
The Problem:
A single worker will often do multiple tasks serially while other workers are idle. The task does not need to be of short duration. A single worker could do two tasks that take multiple seconds each while other workers continue to sleep. This problem is especially common with the fork-and-join pattern. Suppose you wait for N workers to finish their last task, so that you now have N workers that are spinning for work. Then suppose you enqueue N new tasks. The problem is that some of the workers are likely to go to sleep prematurely, while the N tasks are performed by < N worker threads.
The Results:
This problem can be demonstrated with the following example code, which I instrumented using the Tracy profiler (https://github.com/wolfpld/tracy):
A typical run of this code on my 32-core AMD Threadripper looks like this:
In this example, we had 8 worker threads and we enqueued 8 tasks at a time. Ideally each task would run on a different worker thread so that they all execute concurrently. However, as you can see from the above image, a single worker often ends up doing multiple tasks while other worker(s) are idle. Here is what it looks like after applying the changes in this PR:
The difference in performance is also seen in the new benchmark, which is included in this PR. Here are the results for my 32-core AMD Threadripper:
And here are the benchmark results measured on my Macbook Pro with 10 core M1 MAX:
Explanation of the Problem:
1) There were two race conditions in the
spinForWork
function. While spinning, it might discover that(work.num > 0)
, indicating that new work was assigned to it. It would then stop spinning, then try to lock the mutex, then discover that(work.num == 0)
because another spinning worker stole the task from it. With no tasks to perform, the thread would then go to sleep. It should have continued spinning until it had a task (that could not be stolen from it), or until the timeout. Meanwhile, theenqueue
function may have assigned a task to the same worker that just stole a task. That worker will end up with two tasks and the other worker will be asleep.2) The second race condition was similar. While spinning for work, one thread might successfully steal work from another. It would then stop spinning, then lock the mutex, then discover that
(work.num == 0)
because the work was stolen from it. With no work, the thread would then go to sleep. Again, it should have continued spinning until it had work or until the timeout. I have observed that it is actually quite common for a task to be stolen multiple times in a row, when there are several spinning workers.3) It is also possible for the
enqueue
function to assign a new task to a thread that already has work to do. This can happen while there other worker threads are asleep. If there are no spinning workers, then nobody can steal the extra task(s). This is another situation where one worker may spend a long time working on serial tasks while other workers sleep.Solutions:
spinForWork
withspinForWorkAndLock
. As the name suggests, it now acquires the mutex lock before returning. If it detects that(work.num > 0)
or if it successfully steals a task, then it will return while holding the lock. This guarantees that the thread will not stop spinning until it has a task that cannot be stolen, or until the timeout is reached.enqueue
function would need to do more work to ensure that it has found the best thread for the new task. That would add cost, so I didn't do it. Instead, I increased the size of thespinningWorkers
queue. Previously, it would only remember the last 8 workers that started spinning. Now, thespinningWorkers
queue can be as large as the number of worker threads. Thus, if N threads start spinning, then thespinningWorkers
queue will have N unique thread indices. The next N tasks that are enqueued will attempt to use those spinning workers. I believe this results in a better distribution of work than usingnextEnqueueIndex
. Unfortunately, this solution is not bullet proof. For example, thespinningWorkers
queue might point to a worker that is not actually spinning because that worker might have successfully stolen a task. Or, thetry_lock
might fail, causing theenqueue
function to fall back onnextEnqueueIndex
.