Closed shubhamkumar13 closed 2 years ago
Can you also link to where the game_of_life sources are available? This would be useful for reproduction.
I've fixed the issue with more minor GCs in a recent commit. https://github.com/ocaml-multicore/domainslib/commit/aca9ea9d13539e41c4d05a144c4ac4141bce221d. However, I still noticed similar performance difference as earlier. Can you confirm this?
game_of_life (with task api) game_of_life (with channels)
The above code contains print statements. When comparing the time taken by each code I've omitted the printing part.
I haven't reproduced it with the updated version of recv_poll
, but I'll do it asap.
Time Pre aca9ea9 |
Domains | Channels | Tasks |
---|---|---|---|
1 | 50.486 | 50.612 | |
2 | 25.355 | 25.359 | |
4 | 12.763 | 12.742 | |
8 | 6.472 | 6.441 | |
12 | 4.416 | 4.502 | |
16 | 3.376 | 3.322 | |
20 | 2.794 | 2.842 | |
24 | 2.357 | 3.02 |
Time Post aca9ea9 |
Domains | Channels | Tasks |
---|---|---|---|
1 | 45.187 | 44.37 | |
2 | 23.105 | 22.286 | |
4 | 11.663 | 11.201 | |
8 | 5.952 | 5.668 | |
12 | 4.078 | 3.976 | |
16 | 3.102 | 2.927 | |
20 | 2.534 | 2.513 | |
24 | 2.165 | 2.67 |
The aca9ea9
commit makes channels' and tasks' running time at par. The case when the number of domains = 24 is the only exception as far as I can see.
Channels :
Task :
The only difference I see is that the channels' Process 0 handles interrupt while the task's implementation doesn't
Ok. This is quite useful. Is there any difference in minor/major allocation / collections? It is not ideal that domain 0 is up to something in the task library at 24 domains. Any insights on how to improve this?
I took a brief look at what's going on using perf
Task based one:
$ sudo perf stat -d -d -d taskset --cpu-list 2-13,16-27 chrt -r 1 _build/default/life.exe 8 32 > /dev/null
Performance counter stats for 'taskset --cpu-list 2-13,16-27 chrt -r 1 _build/default/life.exe 8 32':
6451.373141 task-clock (msec) # 7.024 CPUs utilized
278 context-switches # 0.043 K/sec
7 cpu-migrations # 0.001 K/sec
4,762 page-faults # 0.738 K/sec
14,10,37,73,143 cycles # 2.186 GHz (61.81%)
10,92,49,34,212 instructions # 0.77 insn per cycle (69.69%)
1,95,94,05,759 branches # 303.719 M/sec (69.70%)
34,82,03,068 branch-misses # 17.77% of all branches (69.67%)
4,20,22,61,296 L1-dcache-loads # 651.375 M/sec (69.69%)
1,69,14,320 L1-dcache-load-misses # 0.40% of all L1-dcache hits (69.78%)
2,28,921 LLC-loads # 0.035 M/sec (69.49%)
14,994 LLC-load-misses # 6.55% of all LL-cache hits (69.18%)
<not supported> L1-icache-loads
8,87,531 L1-icache-load-misses (68.80%)
4,21,38,41,980 dTLB-loads # 653.170 M/sec (68.40%)
60,833 dTLB-load-misses # 0.00% of all dTLB cache hits (60.81%)
3,85,245 iTLB-loads # 0.060 M/sec (61.15%)
8,306 iTLB-load-misses # 2.16% of all iTLB cache hits (61.51%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
0.918466448 seconds time elapsed
Channel based one:
$ sudo perf stat -d -d -d taskset --cpu-list 2-13,16-27 chrt -r 1 _build/default/life_chan.exe 8 32 > /dev/null
Performance counter stats for 'taskset --cpu-list 2-13,16-27 chrt -r 1 _build/default/life_chan.exe 8 32':
5913.120151 task-clock (msec) # 6.956 CPUs utilized
263 context-switches # 0.044 K/sec
7 cpu-migrations # 0.001 K/sec
4,769 page-faults # 0.807 K/sec
12,92,58,46,442 cycles # 2.186 GHz (61.07%)
10,84,83,09,319 instructions # 0.84 insn per cycle (68.80%)
1,96,65,42,992 branches # 332.573 M/sec (68.88%)
35,05,69,980 branch-misses # 17.83% of all branches (68.91%)
4,16,47,99,597 L1-dcache-loads # 704.332 M/sec (68.91%)
1,70,57,139 L1-dcache-load-misses # 0.41% of all L1-dcache hits (69.36%)
2,25,617 LLC-loads # 0.038 M/sec (69.59%)
22,118 LLC-load-misses # 9.80% of all LL-cache hits (69.58%)
<not supported> L1-icache-loads
6,55,354 L1-icache-load-misses (69.65%)
4,17,28,41,970 dTLB-loads # 705.692 M/sec (69.57%)
17,932 dTLB-load-misses # 0.00% of all dTLB cache hits (61.53%)
3,76,332 iTLB-loads # 0.064 M/sec (61.52%)
29,448 iTLB-load-misses # 7.83% of all iTLB cache hits (61.44%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
0.850127551 seconds time elapsed
The task-based one executes fewer instructions per cycle. Wonder where the stalls might be? @sadiqj
Ok. This is quite useful. Is there any difference in minor/major allocation / collections? It is not ideal that domain 0 is up to something in the task library at 24 domains. Any insights on how to improve this?
Sorry 😅️, I had no insights for improving this.
Here's an oddity, I'm running domainslib
0.2 which doesn't have https://github.com/ocaml-multicore/domainslib/commit/aca9ea9d13539e41c4d05a144c4ac4141bce221d and yet I see very little difference in performance between task and chan on a dual 36 core machine:
time ./game_of_life_multicore_task.exe 4 256 > /dev/null
real 0m7.614s
user 0m29.458s
sys 0m0.036s
--
time ./game_of_life_multicore_chan.exe 4 256 > /dev/null
real 0m7.347s
user 0m28.455s
sys 0m0.029s
--
time ./game_of_life_multicore_task.exe 8 256 > /dev/null
real 0m4.167s
user 0m31.413s
sys 0m0.052s
--
time ./game_of_life_multicore_chan.exe 8 256 > /dev/null
real 0m4.103s
user 0m31.130s
sys 0m0.052s
--
time ./game_of_life_multicore_task.exe 16 256 > /dev/null
real 0m2.595s
user 0m36.173s
sys 0m0.064s
--
time ./game_of_life_multicore_chan.exe 16 256 > /dev/null
real 0m2.656s
user 0m36.330s
sys 0m0.049s
--
time ./game_of_life_multicore_task.exe 32 256 > /dev/null
real 0m2.084s
user 0m39.834s
sys 0m0.133s
--
time ./game_of_life_multicore_chan.exe 32 256 > /dev/null
real 0m2.063s
user 0m39.497s
sys 0m0.104s
--
time ./game_of_life_multicore_task.exe 48 256 > /dev/null
real 0m1.726s
user 0m48.704s
sys 0m0.146s
--
time ./game_of_life_multicore_chan.exe 48 256 > /dev/null
real 0m1.761s
user 0m47.047s
sys 0m0.152s
--
time ./game_of_life_multicore_task.exe 64 256 > /dev/null
real 0m1.593s
user 0m56.907s
sys 0m0.146s
--
time ./game_of_life_multicore_chan.exe 64 256 > /dev/null
real 0m1.545s
user 0m52.133s
sys 0m0.145s
@sadiqj that is encouraging but odd. I could very well believe that the slowdown that we witness only occurs on our machines. We may dig in a little to remove any potential overheads.
Hi, I noticed this issue as well. Just for fun, I built a simple ray tracer and parallelized it using domainslib
. As long as I use no more than six domains for rendering (plus one domain to tick the console progress bar), things work just fine. As soon as I ask for seven rendering domains, the program slows down significantly.
Concretely, I can generate my reference image in ~3s using a purely sequential run. If I use 2 domains, I obtain results in just under ~2s. With six domains, I get down to ~1.7s.
But with seven rendering domains, the time jumps an order of magnitude: 18s. If I ask for eight rendering domains, the time jumps to 35s. But then, interestingly, it seems to flatten out. I've tried asking for as many as 20 domains, and the running time hovers around 35s.
Fwiw, I am running this on a Windows laptop (4 physical cores / 8 logical) using Debian inside of WSL2. As a result, my ability to see what exactly is going on is limited (e.g., I don't think I can use perf inside WSL2, but not totally confident).
Anyways, I just thought I'd share on the off chance that this turns out to be helpful. Happy to try to provide more info if needed!
Actually, I was able to get perf going in WSL2. Using time perf record -g -- _build/default/bin/main.exe -max-threads 7
, the elapsed time is about 18 seconds, and perf report
shows this at the top:
- 34.78% 34.78% main.exe main.exe [.] caml_try_run_on_all_domains_with_spin_work
- 21.66% 21.66% main.exe main.exe [.] caml_gc_log
- 18.31% 0.00% main.exe [unknown] [.] 0x00000000000004fd
- 16.94% 0.00% main.exe [unknown] [.] 0x1074c08548fffff9
- 15.27% 0.00% main.exe [unknown] [.] 0x00000000000007fd
- 15.27% 0.00% main.exe [unknown] [.] 0x000055844f643238
- 12.13% 12.13% main.exe main.exe [.] major_collection_slice
- 5.29% 5.29% main.exe main.exe [.] caml_try_stw_empty_minor_heap_on_all_domains
- 4.18% 4.18% main.exe main.exe [.] caml_empty_minor_heaps_once
- 3.21% 3.21% main.exe main.exe [.] caml_try_run_on_all_domains
- 3.16% 3.16% main.exe main.exe [.] caml_adopt_orphaned_work
- 2.12% 2.12% main.exe main.exe [.] caml_final_update_first
Whereas, if I run time perf record -g -- _build/default/bin/main.exe -max-threads 6
, the elapsed time is only 1.8 seconds, and perf report
shows:
- 38.66% 0.00% main.exe [unknown] [.] 0x00000000000004fd
- 33.62% 0.00% main.exe [unknown] [.] 0x00000000000007fd
- 33.62% 0.00% main.exe [unknown] [.] 0x00005566d1ee6238
- 13.98% 13.98% main.exe main.exe [.] caml_curry2
- 11.77% 11.77% main.exe main.exe [.] camlPath_tracerAffinefun_3663 ... elided ...
- 2.24% 2.23% main.exe main.exe [.] caml_try_run_on_all_domains_with_spin_work
I'm still not sure what the underlying story is, but hopefully this points in the right direction?
Domainslib has evolved quite a bit since this issue was reported for game of life benchmark. Notably, the task library has a Chase-Lev work-stealing queue underneath for scheduling tasks (#29) and uses effect handlers to manage task creation (#51).
I ran the same benchmarks again on domainslib.0.4.0 and here are the numbers for 256 2048 : |
Cores | Task | Chan |
---|---|---|---|
1 | 49.19 | 60.06 | |
2 | 25.94 | 36.19 | |
4 | 12.86 | 23.38 | |
8 | 6.72 | 17.32 | |
12 | 4.68 | 15.42 | |
16 | 3.65 | 14.44 | |
20 | 3 | 14.05 | |
24 | 2.7 | 13.42 |
I think it's safe to say we no longer witness slowdown on higher number of cores, hence closing this issue.
@dalev thanks for the inputs. Since your machine contains 4 cores / 8 threads, having more than 8 domains is sure to slow down the program. Ideally, the number of domains = number of physical cores for best results. You might be interested in https://github.com/ocaml-multicore/ocaml-multicore/wiki/Concurrency-and-parallelism-design-notes. I'd suggest trying your benchmark again on the latest version of domainslib. Feel free to make a new issue if needed.
Below are 2 examples, both run Channels' and Tasks' respectively in order
The first one uses 8 domains.
GC stats Minor words: 93193 Promoted words: 462 Major words: 2101822 Minor collections: 24u Major collections: 6
OCAMLRUNPARAM="S=0x400" taskset --cpu-list 2-13 chrt -r 1 _build/default /game_of_life_multicore_task.exe 8 256
GC stats Minor words: 95355 Promoted words: 302 Major words: 2101662 Minor collections: 23u Major collections: 6
The second one uses 12 domains
GC stats Minor words: 144641 Promoted words: 614 Major words: 2101990 Minor collections: 23u Major collections: 6
OCAMLRUNPARAM="S=0x400" taskset --cpu-list 2-13,16-27 chrt -r 1 _build/default/game_of_life_multicore_task.exe 12 256
GC stats Minor words: 7690302 Promoted words: 4392 Major words: 2105768 Minor collections: 57u Major collections: 6
The numbers and the trace file have a different distribution of minor gc calls when one is using the task library and the number of domains is something greater than 8, like 12. (The picture doesn't reveal all the processes, but the behaviour looks similar in all the processes)