Task library slowdown if the number of domains is greater than 8

shubhamkumar13 commented 4 years ago

Below are 2 examples, both run Channels' and Tasks' respectively in order

The first one uses 8 domains.
- Channels
```
OCAMLRUNPARAM="S=0x400" taskset --cpu-list 2-13 chrt -r 1 _build/default/game_of_life_multicore.exe 8 256
```
GC stats Minor words: 93193 Promoted words: 462 Major words: 2101822 Minor collections: 24u Major collections: 6
```
![Screenshot from 2020-04-28 22-18-36](https://user-images.githubusercontent.com/13629677/80514639-67f26b80-899e-11ea-9bbb-2be271209682.png)

+ Tasks
```
OCAMLRUNPARAM="S=0x400" taskset --cpu-list 2-13 chrt -r 1 _build/default /game_of_life_multicore_task.exe 8 256

GC stats Minor words: 95355 Promoted words: 302 Major words: 2101662 Minor collections: 23u Major collections: 6
```
![Screenshot from 2020-04-28 22-18-03](https://user-images.githubusercontent.com/13629677/80514612-5e690380-899e-11ea-91c5-a8910552d4c5.png)
```
The second one uses 12 domains
- Channels
```
OCAMLRUNPARAM="S=0x400" taskset --cpu-list 2-13,16-27 chrt -r 1 _build/default/game_of_life_multicore.exe 12 256 
```
GC stats Minor words: 144641 Promoted words: 614 Major words: 2101990 Minor collections: 23u Major collections: 6
```
![Screenshot from 2020-04-28 22-10-21](https://user-images.githubusercontent.com/13629677/80513978-7db36100-899d-11ea-9c50-fbf668cc0375.png)

+ Tasks
```
OCAMLRUNPARAM="S=0x400" taskset --cpu-list 2-13,16-27 chrt -r 1 _build/default/game_of_life_multicore_task.exe 12 256

GC stats Minor words: 7690302 Promoted words: 4392 Major words: 2105768 Minor collections: 57u Major collections: 6
```
![Screenshot from 2020-04-28 22-08-58](https://user-images.githubusercontent.com/13629677/80513626-f36afd00-899c-11ea-84e5-99af2831020d.png)
```

The numbers and the trace file have a different distribution of minor gc calls when one is using the task library and the number of domains is something greater than 8, like 12. (The picture doesn't reveal all the processes, but the behaviour looks similar in all the processes)

kayceesrk commented 4 years ago

Can you also link to where the game_of_life sources are available? This would be useful for reproduction.

I've fixed the issue with more minor GCs in a recent commit. https://github.com/ocaml-multicore/domainslib/commit/aca9ea9d13539e41c4d05a144c4ac4141bce221d. However, I still noticed similar performance difference as earlier. Can you confirm this?

shubhamkumar13 commented 4 years ago

game_of_life (with task api) game_of_life (with channels)

The above code contains print statements. When comparing the time taken by each code I've omitted the printing part.

I haven't reproduced it with the updated version of recv_poll, but I'll do it asap.

shubhamkumar13 commented 4 years ago

Time Pre `aca9ea9`	Domains	Channels
1	50.486	50.612
2	25.355	25.359
4	12.763	12.742
8	6.472	6.441
12	4.416	4.502
16	3.376	3.322
20	2.794	2.842
24	2.357	3.02

Time Post `aca9ea9`	Domains	Channels
1	45.187	44.37
2	23.105	22.286
4	11.663	11.201
8	5.952	5.668
12	4.078	3.976
16	3.102	2.927
20	2.534	2.513
24	2.165	2.67

The aca9ea9 commit makes channels' and tasks' running time at par. The case when the number of domains = 24 is the only exception as far as I can see.

Channels :

Screenshot from 2020-05-01 17-03-51

Task :

Screenshot from 2020-05-01 17-04-13

The only difference I see is that the channels' Process 0 handles interrupt while the task's implementation doesn't

kayceesrk commented 4 years ago

Ok. This is quite useful. Is there any difference in minor/major allocation / collections? It is not ideal that domain 0 is up to something in the task library at 24 domains. Any insights on how to improve this?

kayceesrk commented 4 years ago

I took a brief look at what's going on using perf

Task based one:

$ sudo perf stat -d -d -d taskset --cpu-list 2-13,16-27 chrt -r 1 _build/default/life.exe 8 32 > /dev/null

 Performance counter stats for 'taskset --cpu-list 2-13,16-27 chrt -r 1 _build/default/life.exe 8 32':

       6451.373141      task-clock (msec)         #    7.024 CPUs utilized          
               278      context-switches          #    0.043 K/sec                  
                 7      cpu-migrations            #    0.001 K/sec                  
             4,762      page-faults               #    0.738 K/sec                  
   14,10,37,73,143      cycles                    #    2.186 GHz                      (61.81%)
   10,92,49,34,212      instructions              #    0.77  insn per cycle           (69.69%)
    1,95,94,05,759      branches                  #  303.719 M/sec                    (69.70%)
      34,82,03,068      branch-misses             #   17.77% of all branches          (69.67%)
    4,20,22,61,296      L1-dcache-loads           #  651.375 M/sec                    (69.69%)
       1,69,14,320      L1-dcache-load-misses     #    0.40% of all L1-dcache hits    (69.78%)
          2,28,921      LLC-loads                 #    0.035 M/sec                    (69.49%)
            14,994      LLC-load-misses           #    6.55% of all LL-cache hits     (69.18%)
   <not supported>      L1-icache-loads                                             
          8,87,531      L1-icache-load-misses                                         (68.80%)
    4,21,38,41,980      dTLB-loads                #  653.170 M/sec                    (68.40%)
            60,833      dTLB-load-misses          #    0.00% of all dTLB cache hits   (60.81%)
          3,85,245      iTLB-loads                #    0.060 M/sec                    (61.15%)
             8,306      iTLB-load-misses          #    2.16% of all iTLB cache hits   (61.51%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

       0.918466448 seconds time elapsed

Channel based one:

$ sudo perf stat -d -d -d taskset --cpu-list 2-13,16-27 chrt -r 1 _build/default/life_chan.exe 8 32 > /dev/null

 Performance counter stats for 'taskset --cpu-list 2-13,16-27 chrt -r 1 _build/default/life_chan.exe 8 32':

       5913.120151      task-clock (msec)         #    6.956 CPUs utilized          
               263      context-switches          #    0.044 K/sec                  
                 7      cpu-migrations            #    0.001 K/sec                  
             4,769      page-faults               #    0.807 K/sec                  
   12,92,58,46,442      cycles                    #    2.186 GHz                      (61.07%)
   10,84,83,09,319      instructions              #    0.84  insn per cycle           (68.80%)
    1,96,65,42,992      branches                  #  332.573 M/sec                    (68.88%)
      35,05,69,980      branch-misses             #   17.83% of all branches          (68.91%)
    4,16,47,99,597      L1-dcache-loads           #  704.332 M/sec                    (68.91%)
       1,70,57,139      L1-dcache-load-misses     #    0.41% of all L1-dcache hits    (69.36%)
          2,25,617      LLC-loads                 #    0.038 M/sec                    (69.59%)
            22,118      LLC-load-misses           #    9.80% of all LL-cache hits     (69.58%)
   <not supported>      L1-icache-loads                                             
          6,55,354      L1-icache-load-misses                                         (69.65%)
    4,17,28,41,970      dTLB-loads                #  705.692 M/sec                    (69.57%)
            17,932      dTLB-load-misses          #    0.00% of all dTLB cache hits   (61.53%)
          3,76,332      iTLB-loads                #    0.064 M/sec                    (61.52%)
            29,448      iTLB-load-misses          #    7.83% of all iTLB cache hits   (61.44%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

       0.850127551 seconds time elapsed

The task-based one executes fewer instructions per cycle. Wonder where the stalls might be? @sadiqj

shubhamkumar13 commented 4 years ago

Ok. This is quite useful. Is there any difference in minor/major allocation / collections? It is not ideal that domain 0 is up to something in the task library at 24 domains. Any insights on how to improve this?

Sorry 😅️, I had no insights for improving this.

sadiqj commented 4 years ago

Here's an oddity, I'm running domainslib 0.2 which doesn't have https://github.com/ocaml-multicore/domainslib/commit/aca9ea9d13539e41c4d05a144c4ac4141bce221d and yet I see very little difference in performance between task and chan on a dual 36 core machine:

time ./game_of_life_multicore_task.exe 4 256 > /dev/null

real    0m7.614s
user    0m29.458s
sys 0m0.036s
--
time ./game_of_life_multicore_chan.exe 4 256 > /dev/null

real    0m7.347s
user    0m28.455s
sys 0m0.029s
--
time ./game_of_life_multicore_task.exe 8 256 > /dev/null

real    0m4.167s
user    0m31.413s
sys 0m0.052s
--
time ./game_of_life_multicore_chan.exe 8 256 > /dev/null

real    0m4.103s
user    0m31.130s
sys 0m0.052s
--
time ./game_of_life_multicore_task.exe 16 256 > /dev/null

real    0m2.595s
user    0m36.173s
sys 0m0.064s
--
time ./game_of_life_multicore_chan.exe 16 256 > /dev/null

real    0m2.656s
user    0m36.330s
sys 0m0.049s
--
time ./game_of_life_multicore_task.exe 32 256 > /dev/null

real    0m2.084s
user    0m39.834s
sys 0m0.133s
--
time ./game_of_life_multicore_chan.exe 32 256 > /dev/null

real    0m2.063s
user    0m39.497s
sys 0m0.104s
--
time ./game_of_life_multicore_task.exe 48 256 > /dev/null

real    0m1.726s
user    0m48.704s
sys 0m0.146s
--
time ./game_of_life_multicore_chan.exe 48 256 > /dev/null

real    0m1.761s
user    0m47.047s
sys 0m0.152s
--
time ./game_of_life_multicore_task.exe 64 256 > /dev/null

real    0m1.593s
user    0m56.907s
sys 0m0.146s
--
time ./game_of_life_multicore_chan.exe 64 256 > /dev/null

real    0m1.545s
user    0m52.133s
sys 0m0.145s

kayceesrk commented 4 years ago

@sadiqj that is encouraging but odd. I could very well believe that the slowdown that we witness only occurs on our machines. We may dig in a little to remove any potential overheads.

dalev commented 3 years ago

Hi, I noticed this issue as well. Just for fun, I built a simple ray tracer and parallelized it using domainslib. As long as I use no more than six domains for rendering (plus one domain to tick the console progress bar), things work just fine. As soon as I ask for seven rendering domains, the program slows down significantly.

Concretely, I can generate my reference image in ~3s using a purely sequential run. If I use 2 domains, I obtain results in just under ~2s. With six domains, I get down to ~1.7s.

But with seven rendering domains, the time jumps an order of magnitude: 18s. If I ask for eight rendering domains, the time jumps to 35s. But then, interestingly, it seems to flatten out. I've tried asking for as many as 20 domains, and the running time hovers around 35s.

Fwiw, I am running this on a Windows laptop (4 physical cores / 8 logical) using Debian inside of WSL2. As a result, my ability to see what exactly is going on is limited (e.g., I don't think I can use perf inside WSL2, but not totally confident).

Anyways, I just thought I'd share on the off chance that this turns out to be helpful. Happy to try to provide more info if needed!

dalev commented 3 years ago

Actually, I was able to get perf going in WSL2. Using time perf record -g -- _build/default/bin/main.exe -max-threads 7, the elapsed time is about 18 seconds, and perf report shows this at the top:

34.78% 34.78% main.exe main.exe [.] caml_try_run_on_all_domains_with_spin_work

21.66% 21.66% main.exe main.exe [.] caml_gc_log

18.31% 0.00% main.exe [unknown] [.] 0x00000000000004fd

16.94% 0.00% main.exe [unknown] [.] 0x1074c08548fffff9

15.27% 0.00% main.exe [unknown] [.] 0x00000000000007fd

15.27% 0.00% main.exe [unknown] [.] 0x000055844f643238

12.13% 12.13% main.exe main.exe [.] major_collection_slice

5.29% 5.29% main.exe main.exe [.] caml_try_stw_empty_minor_heap_on_all_domains

4.18% 4.18% main.exe main.exe [.] caml_empty_minor_heaps_once

3.21% 3.21% main.exe main.exe [.] caml_try_run_on_all_domains

3.16% 3.16% main.exe main.exe [.] caml_adopt_orphaned_work

2.12% 2.12% main.exe main.exe [.] caml_final_update_first

Whereas, if I run time perf record -g -- _build/default/bin/main.exe -max-threads 6, the elapsed time is only 1.8 seconds, and perf report shows:

38.66% 0.00% main.exe [unknown] [.] 0x00000000000004fd

33.62% 0.00% main.exe [unknown] [.] 0x00000000000007fd

33.62% 0.00% main.exe [unknown] [.] 0x00005566d1ee6238

13.98% 13.98% main.exe main.exe [.] caml_curry2

11.77% 11.77% main.exe main.exe [.] camlPath_tracerAffinefun_3663 ... elided ...

2.24% 2.23% main.exe main.exe [.] caml_try_run_on_all_domains_with_spin_work

I'm still not sure what the underlying story is, but hopefully this points in the right direction?

Sudha247 commented 2 years ago

Domainslib has evolved quite a bit since this issue was reported for game of life benchmark. Notably, the task library has a Chase-Lev work-stealing queue underneath for scheduling tasks (#29) and uses effect handlers to manage task creation (#51).

I ran the same benchmarks again on domainslib.0.4.0 and here are the numbers for `256 2048`:	Cores	Task
1	49.19	60.06
2	25.94	36.19
4	12.86	23.38
8	6.72	17.32
12	4.68	15.42
16	3.65	14.44
20	3	14.05
24	2.7	13.42

I think it's safe to say we no longer witness slowdown on higher number of cores, hence closing this issue.

@dalev thanks for the inputs. Since your machine contains 4 cores / 8 threads, having more than 8 domains is sure to slow down the program. Ideally, the number of domains = number of physical cores for best results. You might be interested in https://github.com/ocaml-multicore/ocaml-multicore/wiki/Concurrency-and-parallelism-design-notes. I'd suggest trying your benchmark again on the latest version of domainslib. Feel free to make a new issue if needed.

ocaml-multicore / domainslib

Task library slowdown if the number of domains is greater than 8 #8