Closed shlomi-noach closed 6 years ago
Consensus implementation discussion: there are multiple ways to implement this, all of which have pros and cons. I'll list only the few that I deem likely:
golang
library for leader election & gossipgolang
libraryA discussion of the merits of various approaches follows.
The orchestrator
nodes would need to be able to elect a leader.
The orchestrator
leader will need to advertise changes to its followers. It is advisable that those advertisements enjoy a quorum consensus, or else they may get lost.
Example for changes that need to be advertised: begin-downtime
, begin recovery
, register-candidate
etc. (see initial comment).
I have successfully worked with the hashicorp/raft library while authoring freno. Change advertising can be a bit of a hassle, but otherwise this library supports:
Notable that while Consul uses the hashicorp/raft
library, it does not advertise same functionality as a service.
Assuming stable leadership, we may change the way we advertise changes as following:
begin-downtime
"
orchestrator
API to get more information
K/V-wise, etcd
and consul
are comparable.
Choosing a leader via raft
as external service is not as trivial (that I can see).
The etcd/raft
golang
library seems to me to be more complicated than the hashicorp/raft
one; and I already have experience with hashicorp/raft
.
Embedded etcd
removes dependency on 3rd party tool; this is nice and appealing.
The hashicorp/raft
golang
library also requires setting up of a store. If we wish to stick to pure go
, this is raft-boltdb
. However with sqlite
we already give up pure-go
and use cgo
, so we may endup using the default raft-mdb
. Notable that there will be a store file on disk, in addition to our already existing backend DB.
Further idle thought: it should be easy enough to implement hashicorp/raft
's LogStore
and StableStore
via relational backend. If I'm not mistaken it is almost trivial.
it should be easy enough to implement hashicorp/raft's LogStore and StableStore via relational backend.
Since all orchestrator
nodes run detection: should this mean all orchestrator
nodes run detection hooks?
tracking ops applied via raft:
begin-downtime
(but can be discarded if end of downtime is already in the past)end-downtime
begin-maintenance
(but can be discarded if end of maintenance is already in the past)end-maintenance
forget
discover
, so that completely new instances can be shared with all orchestrator
nodes
submit-pool-instances
, user generated info mapping instances to a poolregister-candidate
- a user-based instruction to flag instances with promotion rulesack-cluster-recoveries
register-hostname-unresolve
deregister-hostname-unresolve
Failure detection (so that we can get, if we choose to, a quorum opinion on the state of the failure)
@shlomi-noach I think this could be one of the bigger wins in moving to raft consensus. It might extend the time to make a decision on failover, but could possibly also reduce issues with identifying split-brain scenarios due to network partitions.
I think this could be one of the bigger wins in moving to raft consensus. It might extend the time to make a decision on failover, but could possibly also reduce issues with identifying split-brain scenarios due to network partitions.
@leeparayno it is one of the major catalysts for this development; credit @grypyrg for first bringing this to my attention over a year ago.
The etcd/raft golang library seems to me to be more complicated
etcd raft is not really complicated if you want to look into it. It is designed in a way that it is flexible and portable. It powers quite a few active and noticeable distributed systems: https://github.com/coreos/etcd/tree/master/raft#notable-users.
@xiang90 thank you :) it did seem to me to be more complicated to set up; right now I'm working with the hashicorp/raft
library which I'm already familiar with. There are some limitations to the hashicorp implementation that I see etcd/raft
doesn't share. For example, it seems like I can transfer leadership at will with etcd/raft
, something that I'm unable to do with hashicorp/raft
.
From the not-so-deep look I took into the etcd/raft
code implementation, it seemed unclear to me what I need to implement; the sample projects are large scale and I confess I did not investigate the time in understanding how each of them embed etcd/raft
.
However I don't want to make myself appear that lazy. It was easier to pick up on the hashicorp/raft
library because there are some small and clear sample usage repos around, where I could not find the same for etcd/raft
. This made it easier for me to pick up hashicorp/raft
in the first place, and by now I have good understanding of its use and limitations.
the sample projects are large scale
where I could not find the same for etcd/raft.
Check out: https://github.com/coreos/etcd/tree/master/contrib/raftexample. Most of etcd/raft users started with it.
But, yea, etcd/raft is a lower level thing than most of other raft implantations for the reason I mentioned above. There is some effort to make it look like other raft (https://github.com/compose/canoe) without losing its flexibility.
Would anyone care to review these documents?
🙇
orchestrator/raft
is in production for a few months now and we are happy. https://speakerdeck.com/shlominoach/orchestrator-on-raft-internals-benefits-and-considerations presents orchestartor/raft
.
Objective
cross DC
orchestrator
deployment with consensus for failovers and mitigating fencing scenarios.secondary (optional) objective: remove MySQL backend dependency.
Current status
At this time (release )
orchestrator
nodes use a shared backend MySQL server for communication & leader election.The high availability of the
orchestrator
setup is composed of:orchestrator
service nodesThe former is easily achieved by running more nodes, all of which connect to the backend database.
The latter is achieved via:
1. Circular Master-Master replication
orchestrator
service indc1
, one MySQL master and oneorchestrator
service indc2
dc1
anddc2
may cause bothorchestrator
services to consider themselves as leaders. Such a network partition is likely to also break cross DC replication thatorchestrator
is meant to tackle in the first place, leading to independent failovers by bothorchestrator
services, leading to potential chaos.2. Galera/XtraDB/InnoDB Cluster setup
3
Galera nodes, in multi-writer mode (each node is writable)3
orchestrator
nodesorchestrator
node connects to a specific Galera nodedc1
) still leaves2
connected Galera nodes, making a quorum. By nature oforchestrator
leader election, theorchestrator
service indc1
cannot become a leader. One of the other two will.A different use case; issues with current design
With existing design, one
orchestrator
node is theleader
, and only the leader discovers and probes MySQL servers. There is no sense in having multiple probingorchestartor
nodes because they all use and write to the same backend DB.By virtue of this design, only one
orchestrator
is running failure detection. There is no sense in having multipleorchestrator
run failure detection because they all rely on the exact same dataset.orchestrator
uses a holistic approach to detecting failure (e.g. in order to declare master failure it conslults replicas to confirm they think their master is broken, too). However, this detection only runs from a single node, and is hence susceptible to network partitioning / fencing.If the
orchestrator
leader runs ondc1
, anddc1
happens to be partitioned away, the leader cannot handle failover to servers indc2
.The cross-DC Galera layout, suggested above, can solve this case, since the isolated
orchestrator
node will never be the active node.We have a use case where we not only don't want to rely on Galera, we also don't even want to rely on MySQL. We want to have a more lightweight, simpler deployment without the hassle of extra databases.
Our specific use case lights up a new design offered in this Issue, bottom-to-top; but let's now observe the offered design top-to-bottom.
orchestrator/raft design
The
orchestrator/raft
design suggests:orchestrator
nodes to communicate viaraft
orchestrator
node to have its own private database backend serverorchestrator
node is to handle its own private DB. There is no replication. There is no DB High Availability setup.orchestrator
nodes run independent detections (each is probing the MySQL servers independently, and typically they should all end up seeing the same picture).orchestrator
nodes to communicate between themselves (viaraft
) changes that are not detected by probing the MySQL servers.orchestrator
nodes run failure detection. They each have their own dataset to analyze.orchestrator
is the leader (decided byraft
consensus)orchestrator
nodes, getting a quorum approval for "do you all agree there's a failure case on this server?"Noteworthy is that cross-
orchestrator
communication is sparse; health-messages will run once per second, and other than that the messages will be mostly user-initiated input, such asbegin-downtime
or recovery steps etc. See breakdown further below.Implications
Since each
orchestrator
node has its own private backend DB, there's no need to sync the databases. There is no replication. Eachorchestrator
node is responsible for maintaining its own DB.There is no specific requirement for MySQL. In fact, there's a POC running on SQLite.
orchestrator
binaryWe get failure detection quorum. As illustrated above, multiple independent
orchestrator
nodes will each run failure analysis.orchestrator
node in its own DC, then we have fought fencing:orchestrator
ndoe in that DC is isolated, hence cannot be the leaderorchestrator
nodes will agree on the type of failure; they will make for a quorum. One of them will be the leader, which will kick in the failover.Is this a simpler or a more complex setup?
An
orchestrator/raft/sqlite
setup would be a simpler setup, which does not involve provisioning MySQL servers. One would need to configorchestrator
with theraft
nodes identities, andorchestrator
will take it from there.An
orchestrator/raft/mysql
is naturally more complex thanorchestrator/raft/sqlite
, however:orchestrator
backendorchestrator
backendImplementation notes
All group members will run independent discovery, and so general discovery information doesn't need to be passed between
orchestrator
nodes.The following information will need to pass between
orchestrator
nodes as group messages:begin-downtime
(but can be discarded if end of downtime is already in the past)end-downtime
begin-maintenance
(but can be discarded if end of maintenance is already in the past)end-maintenance
forget
discover
, so that completely new instances can be shared with allorchestrator
nodessubmit-pool-instances
, user generated info mapping instances to a poolregister-candidate
- a user-based instruction to flag instances with promotion rulesack-cluster-recoveries
register-hostname-unresolve
deregister-hostname-unresolve
Easiest setup would be to load balance
orchestrator
nodes behind proxy, (e.g.haproxy
), such that the proxy would only direct traffic to the active (leader) node.isAuthorizedForAction
can be extended to reject requests on a follower.orchestrator
CLI, because that would directly talk to the DB instead of going throughraft
.orchestrator
service is running, it is OK to open the SQLite DB file from another process (theorchestrator
CLI) and read from it.Add
/api/leader-check
, to be used by load-balancers. Leader will return200 OK
and followers will return404 Not Found
We'd need to be able to bootstrap an empty DB node joining a quorum. For example, if one node out of
3
completely burns, there's still a quorum, but adding a newly provisioned node requires loading of all the data.SQLite
, copy+paste DB fileMySQL
, dump + importorchestrator
schemacc @github/database-infrastructure @dbussink