Open shanedsnyder opened 3 years ago
In GitLab by @shanedsnyder on Aug 13, 2020, 18:56
For na+sm:
I can reproduce this one on my laptop. It takes a similar number of processes to trigger reliably. Basically, some processes are able to create the group fine, but others are repeatedly hitting lookup errors (group members must successfully lookup all group member addresses for group create to succeed).
As you can see, lookups ultimately fail due to a sendmsg()
error. Specifically, returning ETOOMANYREFS
. That one was new to me, but I'm not a well-versed sockets programmer. Digging more:
ETOOMANYREFS
This error can occur for sendmsg(2) when sending a file
descriptor as ancillary data over a UNIX domain socket (see
the description of SCM_RIGHTS, above). It occurs if the
number of "in-flight" file descriptors exceeds the
RLIMIT_NOFILE resource limit and the caller does not have the
CAP_SYS_RESOURCE capability. An in-flight file descriptor is
one that has been sent using sendmsg(2) but has not yet been
accepted in the recipient process using recvmsg(2).
This error is diagnosed since mainline Linux 4.5 (and in some
earlier kernel versions where the fix has been backported).
In earlier kernel versions, it was possible to place an
unlimited number of file descriptors in flight, by sending
each file descriptor with sendmsg(2) and then closing the file
descriptor so that it was not accounted against the
RLIMIT_NOFILE resource limit.
So specifically related to too many file descriptors being sent with sendmsg()
. On my system the max limit for RLIMIT_NOFILE
is 1024. It looks like we're just creating too many 'in flight' file descriptors for the system to handle. I guess it makes sense we'd be more likely to hit that error at job sizes ~32 (32*32=1024).
I'm not sure how to handle this one. We just get a HG_PROTOCOL_ERROR at the SSG level but I suppose we could consider some sort of retry loop.
In GitLab by @shanedsnyder on Aug 13, 2020, 21:20
I don't understand the shared memory address lookup code path well enough to know whether this is something that you can work around with retries or something else. Could also be that Mercury philosophy is to just rely on users to work around these errors, in which case we'd have to figure out if there's anything we could do within in Margo or SSG.
@soumagne does anything seem obvious to you? I'm sure there are higher Mercury priorities, but I could see users hitting this from time to time, just trying to run larger SSG examples on a single node. Although, I guess any Mercury use case that concurrently looks up many na+sm addresses could trigger this -- that could be an issue in multi-node tests, potentially, since Mercury can still fall back to shared memory for node-local communication.
In GitLab by @shanedsnyder on Aug 13, 2020, 21:30
Also, I tried to reproduce the errors with ofi+sockets and 48 processes but am not having any luck. Even bumping up to 64 works reliably for me.
Are you just using the latest tagged versions in Spack for everything (mercury, argobots, margo, ofi, ssg)? I was testing using manual installs of master branches of everything, but can try having Spack build as well.
In GitLab by @soumagne on Aug 13, 2020, 22:26
Ah ok definitely looks like a na+sm issue, I had not seen that one yet. No worries I can take a look :) Do you mind filing a bug report on GitHub? How do you look up addresses? I imagine you're doing a lot of lookups at the same time here for it to trigger that issue, could you please give me more details about it? If you have a chance it would interesting to know if you see the same issue with previous mercury 2.0a1.
In GitLab by @shanedsnyder on Aug 14, 2020, 08:50
Sure thing, I'll get you more details on a Mercury GitHub issue shortly. Thanks @soumagne!
In GitLab by @shanedsnyder on Aug 14, 2020, 13:46
FWIW, I re-ran tests after adding some code to modify the system limit for RLIMIT_NOFILE, and things do run more reliably. I don't think that's the long-term solution here, just wanted to see if we hit any further problems at this scale, assuming we can get past this lookup issue.
In GitLab by @roblatham00 on Aug 14, 2020, 15:45
Github issue: https://github.com/mercury-hpc/mercury/issues/385
In GitLab by @roblatham00 on Sep 16, 2020, 09:13
a lot of the attention on this issue has been on the file descripors used in the SM module. The general theme of "what happens if we launch a ton of SSG providers" though still poses some challenges. Here's a recent post to mochi-devel:
| From: | "Sim, Hyogi" <simh@ornl.gov> |
|----------|---------------------------------------------------------------|
| To: | mochi-devel@lists.mcs.anl.gov <mochi-devel@lists.mcs.anl.gov> |
| Subject: | [Mochi-devel] [SSG] pmix initialization failure |
| Date: | Tue, 15 Sep 2020 15:59:23 +0000 (09/15/2020 10:59:23 AM) |
I am initializing a SSG group using pmix on Summit@. The initialization works as expected, but only up to a certain number of compute nodes (~ 256 nodes). The group initialization seems always unsuccessful with 512+ nodes. Assuming that ssg itself has been tested in a larger scale, I am wondering if you see any obvious problems in my code below.
In GitLab by @roblatham00 on Aug 12, 2020, 16:01
This limitation does not seem to be SSG but I am beating up on SSG so that's where I noticed it. If I try to start up lots of SSG providers on one node (my laptop) I get errors:
result:
'sockets' gets a little bit further: it at least dumps the group membership information for all 48 members before reporting