Closed moazzammoriani closed 2 years ago
Another thing worth documenting here is the following. I encounterd a strange bug while working on the Domainslib adaptation of fannkuchredux_multicore
that involves parallel_for
. While working on the benchmark, I laid down a printf
statement within the parallel_for
loop but when trying to remove this during a refactoring, the program produces a segmentation fault at runtime after compiling successfully.
I have no idea what is causing this. To someone wishing to investate this, it is important to keep in mind that the current benchmark is compiled using the -noassert
and -unsafe
flags (in the dune file).
For now the printf
statement is included to make the benchmark work.
Full scalling of Domainslib from one core to twenty-four when n = 12:
moazzam@godel:~/benchmarks/fankuchredux$ ./run_all.sh _build/default/fr_domainslib.exe 12
++ exe=_build/default/fr_domainslib.exe
++ arg=12
++ for i in 1 2 4 8 12
++ taskset --cpu-list 2-13 chrt -r 1 _build/default/fr_domainslib.exe 1 12
3968050
Pfannkuchen(12) = 65
real 0m48.678s
user 0m48.673s
sys 0m0.004s
++ for i in 1 2 4 8 12
++ taskset --cpu-list 2-13 chrt -r 1 _build/default/fr_domainslib.exe 2 12
3968050
Pfannkuchen(12) = 65
real 0m24.574s
user 0m49.140s
sys 0m0.004s
++ for i in 1 2 4 8 12
++ taskset --cpu-list 2-13 chrt -r 1 _build/default/fr_domainslib.exe 4 12
3968050
Pfannkuchen(12) = 65
real 0m12.530s
user 0m49.350s
sys 0m0.008s
++ for i in 1 2 4 8 12
++ taskset --cpu-list 2-13 chrt -r 1 _build/default/fr_domainslib.exe 8 12
3968050
Pfannkuchen(12) = 65
real 0m6.211s
user 0m48.840s
sys 0m0.008s
++ for i in 1 2 4 8 12
++ taskset --cpu-list 2-13 chrt -r 1 _build/default/fr_domainslib.exe 12 12
3968050
Pfannkuchen(12) = 65
real 0m4.144s
user 0m48.900s
sys 0m0.004s
++ for i in 16 20 24
++ taskset --cpu-list 2-13,16-27 chrt -r 1 _build/default/fr_domainslib.exe 16 12
3968050
Pfannkuchen(12) = 65
real 0m3.187s
user 0m49.003s
sys 0m0.048s
++ for i in 16 20 24
++ taskset --cpu-list 2-13,16-27 chrt -r 1 _build/default/fr_domainslib.exe 20 12
3968050
Pfannkuchen(12) = 65
real 0m2.553s
user 0m49.141s
sys 0m0.008s
++ for i in 16 20 24
++ taskset --cpu-list 2-13,16-27 chrt -r 1 _build/default/fr_domainslib.exe 24 12
3968050
Pfannkuchen(12) = 65
real 0m2.105s
user 0m49.101s
sys 0m0.016s
Full scaling of Domainslib version for n = 13
moazzam@godel:~/benchmarks/fannkuchredux$ ./run_all.sh _build/default/fannkuchredux.exe 13
++ exe=_build/default/fannkuchredux.exe
++ arg=13
++ for i in 1 2 4 8 12
++ taskset --cpu-list 2-13 chrt -r 1 _build/default/fannkuchredux.exe 1 13
par_iter: 0
37172332
Pfannkuchen(13) = 80
real 11m44.528s
user 11m44.516s
sys 0m0.004s
++ for i in 1 2 4 8 12
++ taskset --cpu-list 2-13 chrt -r 1 _build/default/fannkuchredux.exe 2 13
par_iter: 1
par_iter: 0
37172332
Pfannkuchen(13) = 80
real 5m55.817s
user 11m51.619s
sys 0m0.004s
++ for i in 1 2 4 8 12
++ taskset --cpu-list 2-13 chrt -r 1 _build/default/fannkuchredux.exe 4 13
37172332
Pfannkuchen(13) = 80
real 2m59.397s
user 11m49.580s
sys 0m0.004s
++ for i in 1 2 4 8 12
++ taskset --cpu-list 2-13 chrt -r 1 _build/default/fannkuchredux.exe 8 13
37172332
Pfannkuchen(13) = 80
real 1m29.326s
user 11m46.051s
sys 0m0.008s
++ for i in 1 2 4 8 12
++ taskset --cpu-list 2-13 chrt -r 1 _build/default/fannkuchredux.exe 12 13
37172332
Pfannkuchen(13) = 80
real 1m1.000s
user 11m47.881s
sys 0m0.012s
++ for i in 16 20 24
++ taskset --cpu-list 2-13,16-27 chrt -r 1 _build/default/fannkuchredux.exe 16 13
37172332
Pfannkuchen(13) = 80
real 0m45.460s
user 11m46.408s
sys 0m0.020s
++ for i in 16 20 24
++ taskset --cpu-list 2-13,16-27 chrt -r 1 _build/default/fannkuchredux.exe 20 13
37172332
Pfannkuchen(13) = 80
real 0m36.996s
user 11m48.839s
sys 0m0.004s
++ for i in 16 20 24
++ taskset --cpu-list 2-13,16-27 chrt -r 1 _build/default/fannkuchredux.exe 24 13
37172332
Pfannkuchen(13) = 80
real 0m31.306s
user 11m48.656s
sys 0m0.008s
LGTM!
This is in response to #370. The benchmark linked on the aformentioned issue has been adapted to use Domainslib instead of the Unix module to achieve parallelism. There is little difference between the sequential and the parallel version except that the parallel uses
parallel_for
at one place instead of a regular for loop.Running the parallel version on my own four-core machine, utilizing all four cores, the Unix version seemingly seems to perform better.
(Note: the names of the executables that result from PR have been modified in this GH discussion for the sake of readability. The output of the programs has also been cleaned up for readability)
However, the one the version that uses Domainslib performs slightly better as it is give more cores to work with. Note that in the Domainslib version there is a one-to-one correspondence between workers and domains. Whereas in the Unix version the number of workers is fixed at 24.
Unix version:
Domainslib version: