scalability issue in `ssg_group_observe` #23

Open shanedsnyder opened 3 years ago

In GitLab by @roblatham00 on Aug 12, 2020, 17:04

ssg-observe

I have added timing to the ssg-observe-group.c test to time how long it takes to observe an ssg group for varying numbers of processes. I am collecting timings on all MPI processes and then collecting the min, max, and average times to observe the group.

I am concerned that as the number of processes increases, the distribution of response times gets larger and larger. Notice that the min time is fairly consistent -- some process is indeed observing the group nice and speedy-like. The max time grows, as does the average time.

This distribution of timing suggests to me that something is serializing the observe request.

In GitLab by @roblatham00 on Aug 21, 2020, 10:35

Here's the same experiment on summit with the verbs transport. The 128 providers are running on two nodes in this case, and the observers also run across two nodes. ssg-observe-summit .

Again, best case pretty steady, worse case getting worse and worse. Still several orders of magnitude from timing out... and yet I cannot turn the crank one more iteration. Trying to observe with 512 observers results in a 15 minute job getting the ol' terminate signal. Unclear if I am seeing a hang, or a very slow response.

for the 256 client case, here is the histogram of client observe times:

0.011148-0.043261 : 144
0.043261-0.075374 : 58
0.075374-0.107487 : 21
0.107487-0.139600 : 30
0.139600-0.171713 : 3

Little bit of a fat tail in the 128 client case, too:

0.012870-0.033966 : 99
0.033966-0.055062 : 28 
0.055062-0.076158 : 0 
0.076158-0.097254 : 0 
0.097254-0.118350 : 1

In GitLab by @roblatham00 on Aug 21, 2020, 16:29

tcmalloc from google perf tools shifts the histogram a little bit but not too dramatic: I'd like to run a few dozen each before making any definitive judgements

0.012774-0.047701 : 170
0.047701-0.082628 : 44
0.082628-0.117555 : 34
0.117555-0.152482 : 7 
0.152482-0.187409 : 1

In GitLab by @roblatham00 on Aug 24, 2020, 09:33

mentioned in merge request !11

In GitLab by @roblatham00 on Sep 1, 2020, 20:14

I almost think this should be part of #22 ...

On Theta, with 64 providers per node (that's one provider per CPU), a 1 and 2 observer process completes. 4 process or more however result in a timeout:

margo_forward_timed(handle, &observe_req, SSG_DEFAULT_OP_TIMEOUT): HG_TIMEOUT
unable to send observe request (ret: -1; 2182678278126458487 (nil) (nil))

In GitLab by @roblatham00 on Sep 1, 2020, 20:23

Working on the hypothesis that one or more ssg providers are getting swamped by observe requests. I repeated this shared memory experiment, but in one case we used ssg_group_observe_target and in the other we used ssg_group_observe. 16 providers and an ever larger number of observers:

ssg-observe-target-vs-random

I will try again with a larger number of providers. I must have had more providers in the previous example.

Maybe if one squints one sees the "target" approach with a lower worst-case? Here are the distribution of client times:

no target:

0.001948-0.031377 : 19
0.031377-0.060806 : 22
0.060806-0.090234 : 11
0.090234-0.119663 : 5
0.119663-0.149092 : 7

explicit target:

0.005078-0.023565 : 17
0.023565-0.042052 : 12
0.042052-0.060538 : 11
0.060538-0.079025 : 17
0.079025-0.097512 : 7

In GitLab by @roblatham00 on Sep 2, 2020, 10:00

Yeah, I guess we can put this "observers slammed one provider" theory to rest: sockets, 32 providers, up to 64 observers:

ssg-observe-sockets-target-vs-random

maybe a little benefit? Not dramatic.

no target:

0.313952-0.542286 : 9
0.542286-0.770620 : 22
0.770620-0.998955 : 23
0.998955-1.227289 : 6
1.227289-1.455623 : 4

explicit target:

0.251246-0.433424 : 13
0.433424-0.615601 : 26
0.615601-0.797778 : 20
0.797778-0.979956 : 3
0.979956-1.162133 : 2

maybe a little improvement in tail latency?

In GitLab by @roblatham00 on Jan 7, 2021, 13:27

We identified an issue with margo_forward_timed on powerpc ( https://xgitlab.cels.anl.gov/sds/margo/-/issues/68 ) so I should re-run this experiment with either plain margo_forward (as in 0.4.3.1) or with argobots@main . However, we would only expect the summit data to change. Theta and my laptop should show no difference.

mochi-hpc / mochi-ssg

scalability issue in `ssg_group_observe` #23