kquick / Thespian

Python Actor concurrency library
MIT License
189 stars 24 forks source link

HA Thespian #62

Open pjz opened 4 years ago

pjz commented 4 years ago

What features are needed to make Thespian more resilient to individual node loss? I see that https://github.com/thespianpy/Thespian/issues/21 is still open, 3.5 years later - is PySyncObj, as suggested in https://github.com/thespianpy/Thespian/issues/20 to (re) determine the Convention Leader still the solution? And a global Registrar to coordinate Actors cross-node? Is that really all that's missing?

kquick commented 4 years ago

Thanks for the nudge, @pjz.

Thespian is already moderately resilient to failures in member nodes for the Convention, it's just the Convention Leader that is a singular component that must be running for the rest of the Convention to operate correctly.

The techniques provided by PySyncObj could be used to propagate convention-related information amongst potential convention leaders, but there are still higher level behaviors and operational concerns to work out. I'm going to start documenting those here as a way to gather and identify the information and decisions that must be made before the point where this work is undertaken:

Please feel free to update this issue with comments on either the above or with additional considerations not yet captured above. I'm particularly interested in use-cases where the ability to migrate Convention leadership would be useful and effective for people: there are a lot of potential aspects to this, so being able to focus the efforts on the most applicable use-cases will help drive some of the design decisions for this.

pjz commented 4 years ago

There are definitely a lot of potential edge cases, some of the easier to address than others. Maybe start with the easy solutions and work up from there?

FWIW, my use case is fairly simple: essentially a streaming web service with some, but not a lot, of synchronization necessary on the back end, with a very flat topology node-wise.. and relatively few (under a dozen) nodes, at that. Since it's so flat, any of them are capable of being CL, so I'd like to make sure that failover works.

IMO, once joined to the Convention, it should be possible to seamlessly swap ConventionLeaders without interruption at any time. I think depending on certain actors being running on the ConventionLeader should be discouraged - instead actors can advertise their utility via capabilities and globalNames, or perhaps on a ConventionRegistrar, and then be delegated to - this lessens the reliance on a single central system and also improves reliability by not only spreading the service(s) across multiple nodes but also easily allowing multiple of whatever vital service.

I look forward to seeing how Thespian evolves - and helping with that evolution, if possible.

kquick commented 4 years ago

Good info, thanks @pjz. I'll go through your detailed responses in more depth soon for any followups.

I want to be careful not to overpromise and to be clear on the supportable scope of HA-related work but your insight is a good one: a relatively simple solution may solve 80% of the need and go a long way to helping people in this area. To that end, I appreciate your description of the use case you are solving, and look forward to hearing about any other use cases people might provide to help scope this work.

Thanks for your continued support and involvement!

pjz commented 4 years ago

I'm trying to put something into pre-production; to that end, I'd like to know what happens if the 'Convention Address.IPv4' capability is updated ? Does it correctly point it at a new Convention Leader? Could this be used to point existing ASs at a new Convention Leader if the original goes down?

kquick commented 4 years ago

I haven't had time to try it for details, but in thinking about this I believe that the local Admin would need to have the updated convention leader address and there's no way to change this at the moment without restarting the local Admin.

pjz commented 4 years ago

Being able to change the Convention Address of a running AS seems like a good fundament first step, so I went and read a bunch of code, and it looks like the LocalConventionState already handles some amount of re-registration, but the logger isn't anything like so dynamic. My experiments ended up with some kind of infinite loop of log messages.. but I'm still trying to figure out if the loop is due to the reported messages looping or the logger looping :)

ghost commented 3 years ago

@kquick -

"Currently all members identify the Convention Leader via a single IP address. Should this be expanded to a list of IP addresses? Should multicast addressing be used for attempting to identify a leader? Should broadcast conventions (e.g. port 5670) be used for leader identification?"

This is a relevant use case for something I am also POC'ing and I happened to raise a bug ( #74 ) without seeing this first. IMHO, similar feature is already present in implementations of actor system in scala, java etc. Will this be incorporated anytime soon?

kquick commented 3 years ago

I haven't had any time to work on HA/failover capabilities, so nothing is imminent. I think this is a fairly interesting idea to pursue, although largely academic since there haven't been a large number of people expressing interest in this or upvoting this particular issue. I'm happy to support anyone wanting to work on extending Thespian in this area, and I may do it someday myself, but I don't envision being able to work on this for the foreseeable future (i.e. at least the next couple of months).

ghost commented 3 years ago

Kevin,

I'll be happy to take a stab at this, and possibly create a PR for you to review. But I could possibly use some pointers to get off the ground.

Best. Arnab

kquick commented 3 years ago

That would be great, Arnab. There's a lot of scoping and discussion above in this issue, and quite a bit of that is intended to scope the overall issue but should not be a concern with the initial POC work. If I were to sketch out initial POC, I would suggest:

  1. Changing the Convention Address.IPv4 attribute to take a list.
  2. On startup, the default convention leader is the first address in that group (lets call the others "Convention Supporters").
  3. When other Convention Supporters check in with the Convention Leader, the latter should recognize them as a Convention Supporter.
  4. Synchronize updates from the Convention Leader to the Convention Supporters.
  5. Convention Supporters should periodically check the health of the Convention Supporter/Leader preceding them in the list.
  6. Differentiate between active and offline Convention Leaders and Supporters in the list.
  7. Update step 5 to check both the preceding entry in the list and the active entry in the list (skipping inactive members, and the active member is the same as the preceding member if the preceding member is up).
  8. When the Convention Supporter whose preceding active member is the Convention Leader detects that the Leader is offline, it should change mode to become the Convention Leader, and send an internal message to all other Actors and Admins to notify them it is the new Convention Leader.
  9. Test all of the above to make sure the Convention can continue to run if the current Convention Leader goes offline.
  10. If a Convention Supporter comes online it should try to register with the current Convention Leader by sending an internal Thespian message to all other entries in the list; the current Convention Leader should be the only one to respond. a. If the Convention Leader preceeds the current ConventionSupporter in the list, the Convention Supporter simply registers in the convention and begins to get updates from the Leader. b. If the Convention Leader follows the current ConventionSupporter in the list, the Convention Supporter registers as in step 10a, and after it gets a convention update, it sends out a message to all other Convention members that it is the new Convention Leader. It also sends out a message to all Actors (duplicating step 8 above). c. If there is no Convention Leader and this is not the first entry in the list, log a failure message and exit

The above is using a very simplistic election mechanism for convention leader; in the longer term, a more sophisticated election process should be used (e.g. RAFT) along with synchronization methods to ensure that actions occurring during the leadership changes aren't lost, and possibly even supporting multiple leaders. I think it will be helpful to defer these more sophisticated processes though until some of the basic functionality can be explored using the above. There are a lot of corner cases and timing issues not handled by the above as well (and it should not attempt to support AdminRouting or AdminRoutingTXOnly modes), so it won't be very robust, but it should help to ensure we have a good identification of what information should be shared/synchronized between leaders and reveal where there are unforeseen issues.

Naturally, you are free to chart a different path if you are working on a POC and I'm happy to support you either way. For ongoing discussion, please feel free to use the thespian mailing list, or if you wish to discuss lower-level issues that may not be of interest to the mailing list, you can reach me directly via gmail at s/kq/kq1q/ of my github username.

ghost commented 3 years ago

@kquick thank for the details. I'll start going through the code to see how exactly life cycle for the convention IP is handled.

I guess I'll start off by forking this repo so that I don't mess up something here. Once I have made progress, I'll share either the forum or offline, with a limited audience.

Best.

kquick commented 3 years ago

Update: @arnab-chanda has done a great job and as of https://github.com/kquick/Thespian/pull/79 we now have initial support for HA capabilities. The current leader selection methodology is fairly simple and we will probably look at using a more sophisticated algorithm in the future, but this implementation should be enough to let people start trying out this functionality and providing feedback on where there are things still needing attention.

This implementation should be backward compatible with existing Thespian; enabling HA is done simply by specifying the Convention Leader.IPv4 capability as a list instead of a single string. Further information will be available as the documentation is updated (coming soon).

kquick commented 2 years ago

The initial support for this is available now in 3.10.6. It should be considered Beta: there is as yet no attempt to synchronize information between different leaders, only the ability to allow a leader to take over if the previously active leader exits.