benchmark for concurrent ABT eventual performance

carns commented 2 years ago

This adds a benchmark called abt-eventual-benchmark that measures how well we can create/wait/set/destroy concurrent eventuals. It approximates a concurrent mochi provider workload by creating N concurrent threads, each of which (in a loop) creates detached sub-threads that margo_thread_sleep() for a specified amount of time before setting an eventual and exiting. The sleep is a stand-in for waiting on network activity.

Example, with a duration of 5 seconds, a sleep interval of 1 ms, and 16 ULTs:

> ./abt-eventual-bench -d 5 -i 1 -n 16
#<num_ults> <test_duration_sec> <interval_msec> <ops/s> <cpu_time_sec>
16  5   1   15952.800000    4.255606

The above example shows 16 ULTs getting nearly 16,000 operations per second, which is pretty much ideal (each can do 1000 per second if everything is working right because there is a 1ms delay in each operation). The last column shows CPU time. In this case a 5 second run took 4.2 seconds of user-space CPU time, which is not expected.

These results are most interesting if you sweep across a range of ULT counts to see how behavior changes with scale. There are scripts included in this PR to do this and plot the results for an example scenario. The scripts are in the tests/ subdirectory. The first takes the eventual executable as it's argument (so that it can account for different build directories). The second takes the .dat file produced by the first script as it's argument and produces two plots.

> tests/sweep-abt-eventual-bench.sh build/perf-regression/abt-eventual-bench
writing results to sweep-abt-eventual-bench.1122789.dat
... running benchmark with 100 ULTs
... running benchmark with 200 ULTs
... running benchmark with 300 ULTs
... running benchmark with 400 ULTs
... running benchmark with 500 ULTs
... running benchmark with 600 ULTs
... running benchmark with 700 ULTs
... running benchmark with 800 ULTs
... running benchmark with 900 ULTs
... running benchmark with 1000 ULTs
results are in sweep-abt-eventual-bench.1122789.dat
> tests/plot-abt-eventual-bench.sh sweep-abt-eventual-bench.1122789.dat
> ls *.png
sweep-abt-eventual-bench.1122789.dat.cpu.png
sweep-abt-eventual-bench.1122789.dat.rate.png

Example plots, showing high cpu usage in each configuration. The eventual rate climbs linearly with the number of ULTs (as expected) until 500 concurrent ULTs, at which point throughput starts dropping rather climbing or plateauing.

sweep-abt-eventual-bench 1122789 dat cpu

The above example is from a laptop with a debugging build; we need to repeat this on real systems.

This benchmark only uses one execution stream (the primary one) so throughput will be limited eventually by how many concurrent ULTs it can keep busy.

Before merging this we need to make sure the benchmark is working correctly, clean up the code a little, and capture instructions in a README.

roblatham00 commented 2 years ago

On summit with argobots-1.1, running the test with jsrun -r1 -n1 -a1

sweep-abt-eventual-bench 1966961 dat cpu

roblatham00 commented 2 years ago

argobots@main: sweep-abt-eventual-bench 1078050 dat cpu

carns commented 2 years ago

Hunh. So why the high CPU time, I wonder? I'm not sure if I structured something wrong in the benchmark or if something is spinning that isn't supposed to.

Glad to see that performance is plateauing rather than dropping after hitting the peak at least.

carns commented 2 years ago

I wondered if the CPU usage was coming from the thread joins in the main routine (since we don't normally use that particular primitive in a Mochi server; the threads are detached). I tested that by adding a margo_thread_sleep() in the main() routine before entering the join loop, though, and it didn't make a difference in CPU consumption.

roblatham00 commented 2 years ago

crusher is of course bizzare sweep-abt-eventual-bench 27840 dat cpu

carns commented 2 years ago

The crusher graph (the rate, not the cpu usage) is what we would hope for, I think. It's should scale linearly in a perfect world.

roblatham00 commented 2 years ago

linearly.... to a point. cranked up the number of ults to see what happened sweep-abt-eventual-bench 54947 dat cpu

carns commented 2 years ago

Maybe the busy spinning is just an artifact of the Mercury progress loop in Margo being cautious about the possibility of other ULTs needing to execute. Can test that by initializing margo with a progress thread (in that case the loop should be allowed to sleep).

carns commented 2 years ago

Activating a progress thread does fix the cpu issue, but also seems to break the timers; there is another (different) problem to sort out in that mode it looks like.

roblatham00 commented 2 years ago

I'm going to take a crack at a futures version

roblatham00 commented 2 years ago

Where are we with this branch? Can we merge it? Is there more work to do?

carns commented 2 years ago

Ok with me. I think we stalled because some of the results were confusing, but I think we should just go ahead and merge it. Some of the summit issues are likely fixed by changes that have happened in argobots@main since we were running these tests.

I'll let you do the merge @roblatham00 if you are comfortable with how the benchmarks look; I haven't tried running this in a while.

carns commented 2 years ago

some of it could be squashed, not a huge deal though

roblatham00 commented 2 years ago

Re-rrunning the experiment with latest code on summit

$ spack find --loaded | less
-- linux-rhel8-power9le / gcc@11.1.0 ----------------------------
argobots@main
autoconf@2.69
cereal@1.3.2
cmake@3.20.2
fmt@8.1.1
gnuconfig@2021-08-14
json-c@0.15
libfabric@1.15.1
m4@1.4.18
mercury@master
mochi-bedrock@develop
mochi-margo@develop
mochi-ssg@develop
mochi-thallium@develop
nlohmann-json@3.10.4
pkg-config@0.29.2
spdlog@1.9.2
tclap@1.2.2

but Argobot's last commit was back in February so I don't expect to see any change. Time for an Argobots issue or are they aware already?

from a summit interactive node:

$ ../tests/sweep-abt-eventual-bench.sh jsrun -r1 -n1 -a1 perf-regression/abt-eventual-bench

sweep-abt-eventual-bench 523050 dat cpu

roblatham00 commented 2 years ago

looking back at february's discussion:

Before merging this we need to make sure the benchmark is working correctly, clean up the code a little, and capture instructions in a README.

[x] ensure benchmark working correctly
[x] clean up code
[x] capture instructions in a README

carns commented 2 years ago

Re-rrunning the experiment with latest code on summit
$ spack find --loaded | less
-- linux-rhel8-power9le / gcc@11.1.0 ----------------------------
argobots@main
autoconf@2.69
cereal@1.3.2
cmake@3.20.2
fmt@8.1.1
gnuconfig@2021-08-14
json-c@0.15
libfabric@1.15.1
m4@1.4.18
mercury@master
mochi-bedrock@develop
mochi-margo@develop
mochi-ssg@develop
mochi-thallium@develop
nlohmann-json@3.10.4
pkg-config@0.29.2
spdlog@1.9.2
tclap@1.2.2
but Argobot's last commit was back in February so I don't expect to see any change. Time for an Argobots issue or are they aware already?

from a summit interactive node:
$ ../tests/sweep-abt-eventual-bench.sh jsrun -r1 -n1 -a1 perf-regression/abt-eventual-bench

Oh right, Ok. Sorry, I got issues mixed up on that.

I don't think there is any Argobots follow up. The CPU usage is high, but after thinking on it some more I believe this is a margo behavior. If all of the threads are running in the same pool, Margo may not be allowing Mercury to sleep during the progress calls. The CPU utilization is likely lower if we init with a dedicated progress thread.

Thanks for getting this wrapped up @roblatham00 !

roblatham00 commented 2 years ago

Kind of a mess.. does this "google perf tool" output confirm your hypothesis? Lots of time spent in __margo_hg_progress_fn and its children, particularly NA_Progress

abt-eventual-profile.pdf

carns commented 2 years ago

Yeah, it does. That explains the CPU time, I think.

roblatham00 commented 2 years ago

One last kick on this dead horse: modified benchmark to init margo with one dedicated progress thread:

 margo_init("na+sm://", MARGO_CLIENT_MODE, 1, 1);

and the end result is what you'd hope to see: just about all samples land on futex_wait:

abt-eventual-progress-profile.pdf

carns commented 2 years ago

Perfect!

roblatham00 commented 2 years ago

benchmark reports low cpu usage too (meant to include that)

#<num_ults>     <test_duration_sec>     <interval_msec> <ops/s> <cpu_time_sec>
100     5       1       1081.000000     0.120891

mochi-hpc-experiments / mochi-tests

benchmark for concurrent ABT eventual performance #41