ssbc / ssb-tribes2

SSB private groups with ssb-db2
8 stars 1 forks source link

Crash resistance for excludeMembers #123

Closed Powersource closed 12 months ago

Powersource commented 1 year ago

For exclusion we post 3 different messages

  1. First the group/exclude-member message. We don't need to recover from this, if we crash on this step, the user can tell and they just have to try again.
  2. Then a group/init to init the new epoch. Hopefully it's enough to look for a 1. msg. But hmm when should we search for that? If we call excludeMembers again with the exact same args? Should excludeMembers maybe just post exclude-member, and msgs 2. and 3. should be left to listeners?
  3. Lastly all remaining members are re-added with group/add-member messages. The lib/epoch function getMissingMembers is probably very helpful here.

Todos:

Powersource commented 1 year ago

Gonna write some more notes to try to figure out exactly how I should go about this.

Goals:

  1. We don't want a dangling exclude-member message in an epoch, that didn't end up having any effect
  2. We don't want a dangling epoch init, in an epoch that didn't end up with any members.
  3. We everyone to get added to the new epoch except for the excluded peers.

Questions:

  1. Do we want to help others with their failed exclusions? Maybe yeah? Since the group/epoch is a common resource and one person crashing might break the group for all of us.
  2. When do we fix a broken state? When calling the function again? If other people should be able to fix it too, then they'll want a listener. Do we want to use that listener for ourselves as well? A listener would only check again on restart, is that fine?
staltz commented 1 year ago

Do we want to help others with their failed exclusions? Maybe yeah? Since the group/epoch is a common resource and one person crashing might break the group for all of us.

Tough question, but I'm also leaning towards anyone helping proceed with the exclusion, simply because that dangling exclude-member msg may be confusing.

  1. When do we fix a broken state? When calling the function again? If other people should be able to fix it too, then they'll want a listener. Do we want to use that listener for ourselves as well? A listener would only check again on restart, is that fine?

Being eager about it shouldn't be a problem, because of the "same membership" forked epoch resolution. So if admin A tried to exclude Oscar but stopped in between, then admins B and C can proceed to do it, and they will create two forked epochs, but they'll have the same membership set, and then tie breaking rule applies.

In terms of code, I don't know how to organize it.

Powersource commented 1 year ago

Tough question, but I'm also leaning towards anyone helping proceed with the exclusion, simply because that dangling exclude-member msg may be confusing.

Yeah I think I basically ended up going with being agnostic towards who made the breaking state.

Being eager about it shouldn't be a problem, because of the "same membership" forked epoch resolution.

I think I was about to try the eager solution as well but ended up deciding against it, since most/all the recovery logic uses long-ish timeouts in it, which would make regular function usage way too slow.