mochi-hpc / mochi-ssg

Scalable Service Groups (SSG), a group membership service for Mochi
Other
1 stars 1 forks source link

Problem with ssg_group_destroy and ssg_group_unobserve #28

Open shanedsnyder opened 3 years ago

shanedsnyder commented 3 years ago

In GitLab by @mdorier on Jan 6, 2021, 06:47

I have a program that uses ssg_group_id_load to load a group from a file, then does ssg_group_observe to get the most recent view of the group. If later I try to do ssg_group_destroy, the program blocks indefinitely in this function. If I do ssg_group_unobserve then ssg_group_destroy, it doesn't block anymore, but ssg_group_destroy fails with the following error: Error: SSG unable to find expected group ID. I'm not sure whether ssg_group_unobserve also destroys the group. I suppose it does, but its name doesn't sound like it should.

I remember a similar problem with ssg_group_leave and ssg_group_destroy, where ssg_group_leave would also destroy the group (even though leaving the group doesn't necessarily mean that we want to destroy it). This calls for more clarity and probably a change of API.

I would suggest the following changes:

Some other modifications I would suggest:

shanedsnyder commented 3 years ago

In GitLab by @shanedsnyder on Jan 6, 2021, 09:37

This is really good feedback. I think you're suggestions make a lot of sense -- we definitely shouldn't be destroying the group as part of the leave/unobserve process, and so that should be an additional step. I'm also fine deprecating the observe functionality in favor of refresh. My only comment there is that in the current code, a lot of the group API calls only work if the caller is a member or an observer, otherwise the caller doesn't have the group view locally. So what happens if a caller calls one of those calls and they haven't refreshed the group at some point? It just fails (e.g., ENOVIEW)? Or the refresh happens implicitly? I think that would work, I'm not sure there's a lot of value in having a dedicated observe state for SSG processes in the current code, so it's not a big deal to remove.

I'm less sure about the last member ID suggestion, but am open to persuasion. My feeling is that it could complicate things in mutli-group situations if there are different SSG IDs for the same Mercury endpoint, but maybe that's unfounded. In the use case you bring up (a process re-joining a group), I feel like we could probably rely on SWIM incarnation numbers to help sort out that a previously failed process is rejoining if that's important?

shanedsnyder commented 3 years ago

In GitLab by @mdorier on Jan 6, 2021, 09:47

Regarding the functions that need a view, I'm in favor of either making them fail if the view is unavailable, or making them act as if the group was empty. We definitely don't want the first call to functions like ssg_get_group_size to randomly send an RPC if the view isn't present, but telling users "you should call ssg_group_refresh right after loading the group if you want to start using it" is easy enough.

Regarding the suggestion about member ids, I don't know how SWIM works internally (i.e. if SWIM runs on a per-group basis or across groups, in which case it would indeed need the same member id in all the groups). We can leave that for later.