Closed carns closed 2 years ago
On summit with argobots-1.1, running the test with jsrun -r1 -n1 -a1
argobots@main:
Hunh. So why the high CPU time, I wonder? I'm not sure if I structured something wrong in the benchmark or if something is spinning that isn't supposed to.
Glad to see that performance is plateauing rather than dropping after hitting the peak at least.
I wondered if the CPU usage was coming from the thread joins in the main routine (since we don't normally use that particular primitive in a Mochi server; the threads are detached). I tested that by adding a margo_thread_sleep() in the main() routine before entering the join loop, though, and it didn't make a difference in CPU consumption.
crusher is of course bizzare
The crusher graph (the rate, not the cpu usage) is what we would hope for, I think. It's should scale linearly in a perfect world.
linearly.... to a point. cranked up the number of ults to see what happened
Maybe the busy spinning is just an artifact of the Mercury progress loop in Margo being cautious about the possibility of other ULTs needing to execute. Can test that by initializing margo with a progress thread (in that case the loop should be allowed to sleep).
Activating a progress thread does fix the cpu issue, but also seems to break the timers; there is another (different) problem to sort out in that mode it looks like.
I'm going to take a crack at a futures
version
Where are we with this branch? Can we merge it? Is there more work to do?
Ok with me. I think we stalled because some of the results were confusing, but I think we should just go ahead and merge it. Some of the summit issues are likely fixed by changes that have happened in argobots@main since we were running these tests.
I'll let you do the merge @roblatham00 if you are comfortable with how the benchmarks look; I haven't tried running this in a while.
some of it could be squashed, not a huge deal though
Re-rrunning the experiment with latest code on summit
$ spack find --loaded | less
-- linux-rhel8-power9le / gcc@11.1.0 ----------------------------
argobots@main
autoconf@2.69
cereal@1.3.2
cmake@3.20.2
fmt@8.1.1
gnuconfig@2021-08-14
json-c@0.15
libfabric@1.15.1
m4@1.4.18
mercury@master
mochi-bedrock@develop
mochi-margo@develop
mochi-ssg@develop
mochi-thallium@develop
nlohmann-json@3.10.4
pkg-config@0.29.2
spdlog@1.9.2
tclap@1.2.2
but Argobot's last commit was back in February so I don't expect to see any change. Time for an Argobots issue or are they aware already?
from a summit interactive node:
$ ../tests/sweep-abt-eventual-bench.sh jsrun -r1 -n1 -a1 perf-regression/abt-eventual-bench
looking back at february's discussion:
Before merging this we need to make sure the benchmark is working correctly, clean up the code a little, and capture instructions in a README.
Re-rrunning the experiment with latest code on summit
$ spack find --loaded | less -- linux-rhel8-power9le / gcc@11.1.0 ---------------------------- argobots@main autoconf@2.69 cereal@1.3.2 cmake@3.20.2 fmt@8.1.1 gnuconfig@2021-08-14 json-c@0.15 libfabric@1.15.1 m4@1.4.18 mercury@master mochi-bedrock@develop mochi-margo@develop mochi-ssg@develop mochi-thallium@develop nlohmann-json@3.10.4 pkg-config@0.29.2 spdlog@1.9.2 tclap@1.2.2
but Argobot's last commit was back in February so I don't expect to see any change. Time for an Argobots issue or are they aware already?
from a summit interactive node:
$ ../tests/sweep-abt-eventual-bench.sh jsrun -r1 -n1 -a1 perf-regression/abt-eventual-bench
Oh right, Ok. Sorry, I got issues mixed up on that.
I don't think there is any Argobots follow up. The CPU usage is high, but after thinking on it some more I believe this is a margo behavior. If all of the threads are running in the same pool, Margo may not be allowing Mercury to sleep during the progress calls. The CPU utilization is likely lower if we init with a dedicated progress thread.
Thanks for getting this wrapped up @roblatham00 !
Kind of a mess.. does this "google perf tool" output confirm your hypothesis? Lots of time spent in __margo_hg_progress_fn
and its children, particularly NA_Progress
Yeah, it does. That explains the CPU time, I think.
One last kick on this dead horse: modified benchmark to init margo with one dedicated progress thread:
margo_init("na+sm://", MARGO_CLIENT_MODE, 1, 1);
and the end result is what you'd hope to see: just about all samples land on futex_wait:
Perfect!
benchmark reports low cpu usage too (meant to include that)
#<num_ults> <test_duration_sec> <interval_msec> <ops/s> <cpu_time_sec>
100 5 1 1081.000000 0.120891
This adds a benchmark called
abt-eventual-benchmark
that measures how well we can create/wait/set/destroy concurrent eventuals. It approximates a concurrent mochi provider workload by creating N concurrent threads, each of which (in a loop) creates detached sub-threads thatmargo_thread_sleep()
for a specified amount of time before setting an eventual and exiting. The sleep is a stand-in for waiting on network activity.Example, with a duration of 5 seconds, a sleep interval of 1 ms, and 16 ULTs:
The above example shows 16 ULTs getting nearly 16,000 operations per second, which is pretty much ideal (each can do 1000 per second if everything is working right because there is a 1ms delay in each operation). The last column shows CPU time. In this case a 5 second run took 4.2 seconds of user-space CPU time, which is not expected.
These results are most interesting if you sweep across a range of ULT counts to see how behavior changes with scale. There are scripts included in this PR to do this and plot the results for an example scenario. The scripts are in the tests/ subdirectory. The first takes the eventual executable as it's argument (so that it can account for different build directories). The second takes the .dat file produced by the first script as it's argument and produces two plots.
Example plots, showing high cpu usage in each configuration. The eventual rate climbs linearly with the number of ULTs (as expected) until 500 concurrent ULTs, at which point throughput starts dropping rather climbing or plateauing.
The above example is from a laptop with a debugging build; we need to repeat this on real systems.
This benchmark only uses one execution stream (the primary one) so throughput will be limited eventually by how many concurrent ULTs it can keep busy.
Before merging this we need to make sure the benchmark is working correctly, clean up the code a little, and capture instructions in a README.