Open shanedsnyder opened 3 years ago
In GitLab by @shanedsnyder on Jan 6, 2021, 09:37
This is really good feedback. I think you're suggestions make a lot of sense -- we definitely shouldn't be destroying the group as part of the leave/unobserve process, and so that should be an additional step. I'm also fine deprecating the observe functionality in favor of refresh. My only comment there is that in the current code, a lot of the group API calls only work if the caller is a member or an observer, otherwise the caller doesn't have the group view locally. So what happens if a caller calls one of those calls and they haven't refreshed the group at some point? It just fails (e.g., ENOVIEW)? Or the refresh happens implicitly? I think that would work, I'm not sure there's a lot of value in having a dedicated observe state for SSG processes in the current code, so it's not a big deal to remove.
I'm less sure about the last member ID suggestion, but am open to persuasion. My feeling is that it could complicate things in mutli-group situations if there are different SSG IDs for the same Mercury endpoint, but maybe that's unfounded. In the use case you bring up (a process re-joining a group), I feel like we could probably rely on SWIM incarnation numbers to help sort out that a previously failed process is rejoining if that's important?
In GitLab by @mdorier on Jan 6, 2021, 09:47
Regarding the functions that need a view, I'm in favor of either making them fail if the view is unavailable, or making them act as if the group was empty. We definitely don't want the first call to functions like ssg_get_group_size
to randomly send an RPC if the view isn't present, but telling users "you should call ssg_group_refresh
right after loading the group if you want to start using it" is easy enough.
Regarding the suggestion about member ids, I don't know how SWIM works internally (i.e. if SWIM runs on a per-group basis or across groups, in which case it would indeed need the same member id in all the groups). We can leave that for later.
In GitLab by @mdorier on Jan 6, 2021, 06:47
I have a program that uses
ssg_group_id_load
to load a group from a file, then doesssg_group_observe
to get the most recent view of the group. If later I try to dossg_group_destroy
, the program blocks indefinitely in this function. If I dossg_group_unobserve
thenssg_group_destroy
, it doesn't block anymore, butssg_group_destroy
fails with the following error:Error: SSG unable to find expected group ID
. I'm not sure whetherssg_group_unobserve
also destroys the group. I suppose it does, but its name doesn't sound like it should.I remember a similar problem with
ssg_group_leave
andssg_group_destroy
, wheressg_group_leave
would also destroy the group (even though leaving the group doesn't necessarily mean that we want to destroy it). This calls for more clarity and probably a change of API.I would suggest the following changes:
ssg_group_observe
andssg_group_unobserve
;ssg_group_refresh
for a non-member to get the most recent view of the group from one of the group members;ssg_group_refresh
should not fail if the caller is a member, it should simply be a no-op;ssg_group_leave
should not destroy the group id. It should simply send a notifications to other processes indicating that the caller is leaving the group. The group id can still be used to get information about the group (e.g. callingssg_group_refresh
);ssg_group_unobserve
would no longer exist andssg_group_leave
would not destroy the group,ssg_group_destroy
would be required no matter how the group id was created (loaded from a file, deserialized, created by MPI or PMIx...), and would be the only way to destroy a group.ssg_group_destroy
while being a member and without having calledssg_group_leave
, the group should be destroyed without notifying other processes (I think this is what happens now). Other processes will act as if the member had failed.int ssg_group_self_is_member(ssg_group_id_t gid)
function would be useful to ask whether the calling process is a member of the group (as opposed to an observer).Some other modifications I would suggest:
ssg_get_self_id
. I would suggest deprecating this function and add anssg_group_get_self_id
, which would also take a group id as argument (and returnSSG_MEMBER_ID_INVALID
if the caller is not a member). By default the member id would still be the hash of the address, but we could eventually add support for re-joining a group simply by changing the process's member id (taking the next one, for instance), and keeping track of the member id on a per-group basis.