litaio / lita

ChatOps for Ruby.
https://www.lita.io
MIT License
1.68k stars 179 forks source link

Document high availability for Lita #159

Open t33chong opened 8 years ago

t33chong commented 8 years ago

@esigler @ranjib and I have been talking about how we can ensure that our Lita instance keeps running in the event of a host failure. It would be very helpful to have some information regarding a recommended deployment scenario for a high availability setup. I'll let them chime in with specific issues so that we can keep track of this discussion publicly.

brodock commented 8 years ago

Redis HA can be handled by using Sentinel. Lita itself I have no idea :/

ranjib commented 8 years ago

@brodock we have the same understanding as well. In short term, redis HA can be done using sentinel, for lita server itself, i was thinking it will be nice to add conul/etcd based leader election support. That will enable us to run multiple lita servers in different DC while only one being active any given time, and others can take over if the leader dies. in the long term it will be nice to have some other kind of key value store (cockroachdb, cassanrda, vitess all look good), cause redis clustering are limited for HA stuff, but this might be bit overloaded, as most lita handler we currently use stores little state information (also most of them can be recalculated)

jimmycuadra commented 8 years ago

I mentioned this briefly to Tristan the other day, but the way I would approach this is the way the podmaster program works for the Kubernetes scheduler and controller manager. It uses a distributed lock via etcd to ensure that only one instance of the application is running at a time and that another one starts up if the one that was running stops. In short:

Each host periodically attempts to set the value of some key K to its own hostname with a TTL of T. It does the set via an atomic compare-and-swap to avoid race conditions. For each host H, there are three possible results:

  1. H is not running Lita. The key is successfully set because it did not exist. That means that H is the master and should start Lita.
  2. H is currently running Lita and the key already exists with the correct value. The TTL (T) is simply refreshed. Lita continues running.
  3. H is not running Lita and the key already exists with another host as the value. H does nothing.

This could probably be implemented with Redis instead of etcd—I think Redis also supports atomic operations.

As far as it being part of Lita itself, it would be possible, but I'd rather see it prototyped as a either another program or a Lita plugin first.

For reference, documentation on the Kubernetes podmaster which does this can be found here: https://github.com/kubernetes/kubernetes/blob/release-1.1/docs/admin/high-availability.md#master-elected-components. The code for the podmaster program itself is here: https://github.com/kubernetes/contrib/tree/be436560df6fa839fb92a2f88ae4c4b7da4e58e4/pod-master

sciurus commented 8 years ago

If you want to do this with consul, you can use the consul lock command to wrap starting lita.

https://www.consul.io/docs/commands/lock.html

indirect commented 8 years ago

Probably relevant to this discussion, after solving the problem of Lita itself being HA you'll need a data store that offers replication and failover—Sentinel is probably okay as long as you don't care about losing data as part of the failover process, and possibly ending up with multiple redis masters: https://aphyr.com/posts/287-asynchronous-replication-with-failover. You'll need a different datastore if you need consistent data while running stores fail over.