Open shanedsnyder opened 3 years ago
In GitLab by @roblatham00 on Aug 21, 2020, 10:35
Here's the same experiment on summit with the verbs transport. The 128 providers are running on two nodes in this case, and the observers also run across two nodes. .
Again, best case pretty steady, worse case getting worse and worse. Still several orders of magnitude from timing out... and yet I cannot turn the crank one more iteration. Trying to observe with 512 observers results in a 15 minute job getting the ol' terminate signal. Unclear if I am seeing a hang, or a very slow response.
for the 256 client case, here is the histogram of client observe times:
0.011148-0.043261 : 144
0.043261-0.075374 : 58
0.075374-0.107487 : 21
0.107487-0.139600 : 30
0.139600-0.171713 : 3
Little bit of a fat tail in the 128 client case, too:
0.012870-0.033966 : 99
0.033966-0.055062 : 28
0.055062-0.076158 : 0
0.076158-0.097254 : 0
0.097254-0.118350 : 1
In GitLab by @roblatham00 on Aug 21, 2020, 16:29
tcmalloc
from google perf tools shifts the histogram a little bit but not too dramatic: I'd like to run a few dozen each before making any definitive judgements
0.012774-0.047701 : 170
0.047701-0.082628 : 44
0.082628-0.117555 : 34
0.117555-0.152482 : 7
0.152482-0.187409 : 1
In GitLab by @roblatham00 on Aug 24, 2020, 09:33
mentioned in merge request !11
In GitLab by @roblatham00 on Sep 1, 2020, 20:14
I almost think this should be part of #22 ...
On Theta, with 64 providers per node (that's one provider per CPU), a 1 and 2 observer process completes. 4 process or more however result in a timeout:
margo_forward_timed(handle, &observe_req, SSG_DEFAULT_OP_TIMEOUT): HG_TIMEOUT
unable to send observe request (ret: -1; 2182678278126458487 (nil) (nil))
In GitLab by @roblatham00 on Sep 1, 2020, 20:23
Working on the hypothesis that one or more ssg providers are getting swamped by observe requests. I repeated this shared memory experiment, but in one case we used ssg_group_observe_target
and in the other we used ssg_group_observe
. 16 providers and an ever larger number of observers:
I will try again with a larger number of providers. I must have had more providers in the previous example.
Maybe if one squints one sees the "target" approach with a lower worst-case? Here are the distribution of client times:
no target:
0.001948-0.031377 : 19
0.031377-0.060806 : 22
0.060806-0.090234 : 11
0.090234-0.119663 : 5
0.119663-0.149092 : 7
explicit target:
0.005078-0.023565 : 17
0.023565-0.042052 : 12
0.042052-0.060538 : 11
0.060538-0.079025 : 17
0.079025-0.097512 : 7
In GitLab by @roblatham00 on Sep 2, 2020, 10:00
Yeah, I guess we can put this "observers slammed one provider" theory to rest: sockets, 32 providers, up to 64 observers:
maybe a little benefit? Not dramatic.
no target:
0.313952-0.542286 : 9
0.542286-0.770620 : 22
0.770620-0.998955 : 23
0.998955-1.227289 : 6
1.227289-1.455623 : 4
explicit target:
0.251246-0.433424 : 13
0.433424-0.615601 : 26
0.615601-0.797778 : 20
0.797778-0.979956 : 3
0.979956-1.162133 : 2
maybe a little improvement in tail latency?
In GitLab by @roblatham00 on Jan 7, 2021, 13:27
We identified an issue with margo_forward_timed
on powerpc ( https://xgitlab.cels.anl.gov/sds/margo/-/issues/68 ) so I should re-run this experiment with either plain margo_forward (as in 0.4.3.1) or with argobots@main
. However, we would only expect the summit data to change. Theta and my laptop should show no difference.
In GitLab by @roblatham00 on Aug 12, 2020, 17:04
I have added timing to the
ssg-observe-group.c
test to time how long it takes to observe an ssg group for varying numbers of processes. I am collecting timings on all MPI processes and then collecting the min, max, and average times to observe the group.I am concerned that as the number of processes increases, the distribution of response times gets larger and larger. Notice that the min time is fairly consistent -- some process is indeed observing the group nice and speedy-like. The max time grows, as does the average time.
This distribution of timing suggests to me that something is serializing the observe request.