openark / orchestrator

MySQL replication topology management and HA
Apache License 2.0
5.61k stars 927 forks source link

Design & RFC: orchestrator on raft #175

Closed shlomi-noach closed 6 years ago

shlomi-noach commented 7 years ago

Objective

cross DC orchestrator deployment with consensus for failovers and mitigating fencing scenarios.

secondary (optional) objective: remove MySQL backend dependency.

Current status

At this time (release ) orchestrator nodes use a shared backend MySQL server for communication & leader election.

The high availability of the orchestrator setup is composed of:

The former is easily achieved by running more nodes, all of which connect to the backend database.

The latter is achieved via:

1. Circular Master-Master replication

2. Galera/XtraDB/InnoDB Cluster setup

A different use case; issues with current design

With existing design, one orchestrator node is the leader, and only the leader discovers and probes MySQL servers. There is no sense in having multiple probing orchestartor nodes because they all use and write to the same backend DB.

By virtue of this design, only one orchestrator is running failure detection. There is no sense in having multiple orchestrator run failure detection because they all rely on the exact same dataset.

orchestrator uses a holistic approach to detecting failure (e.g. in order to declare master failure it conslults replicas to confirm they think their master is broken, too). However, this detection only runs from a single node, and is hence susceptible to network partitioning / fencing.

If the orchestrator leader runs on dc1, and dc1 happens to be partitioned away, the leader cannot handle failover to servers in dc2.

The cross-DC Galera layout, suggested above, can solve this case, since the isolated orchestrator node will never be the active node.

We have a use case where we not only don't want to rely on Galera, we also don't even want to rely on MySQL. We want to have a more lightweight, simpler deployment without the hassle of extra databases.

Our specific use case lights up a new design offered in this Issue, bottom-to-top; but let's now observe the offered design top-to-bottom.

orchestrator/raft design

The orchestrator/raft design suggests:

Noteworthy is that cross-orchestrator communication is sparse; health-messages will run once per second, and other than that the messages will be mostly user-initiated input, such as begin-downtime or recovery steps etc. See breakdown further below.

Implications

Is this a simpler or a more complex setup?

An orchestrator/raft/sqlite setup would be a simpler setup, which does not involve provisioning MySQL servers. One would need to config orchestrator with the raft nodes identities, and orchestrator will take it from there.

An orchestrator/raft/mysql is naturally more complex than orchestrator/raft/sqlite, however:

Implementation notes

cc @github/database-infrastructure @dbussink

shlomi-noach commented 7 years ago

Consensus implementation discussion: there are multiple ways to implement this, all of which have pros and cons. I'll list only the few that I deem likely:

A discussion of the merits of various approaches follows.

Some considerations

The orchestrator nodes would need to be able to elect a leader.

The orchestrator leader will need to advertise changes to its followers. It is advisable that those advertisements enjoy a quorum consensus, or else they may get lost.

Example for changes that need to be advertised: begin-downtime, begin recovery, register-candidate etc. (see initial comment).

Some perceptions and idle thoughts

I have successfully worked with the hashicorp/raft library while authoring freno. Change advertising can be a bit of a hassle, but otherwise this library supports:

Notable that while Consul uses the hashicorp/raft library, it does not advertise same functionality as a service.

Assuming stable leadership, we may change the way we advertise changes as following:

K/V-wise, etcd and consul are comparable.

Choosing a leader via raft as external service is not as trivial (that I can see).

The etcd/raft golang library seems to me to be more complicated than the hashicorp/raft one; and I already have experience with hashicorp/raft.

Embedded etcd removes dependency on 3rd party tool; this is nice and appealing.

The hashicorp/raft golang library also requires setting up of a store. If we wish to stick to pure go, this is raft-boltdb. However with sqlite we already give up pure-go and use cgo, so we may endup using the default raft-mdb. Notable that there will be a store file on disk, in addition to our already existing backend DB.

shlomi-noach commented 7 years ago

Further idle thought: it should be easy enough to implement hashicorp/raft's LogStore and StableStore via relational backend. If I'm not mistaken it is almost trivial.

shlomi-noach commented 7 years ago

it should be easy enough to implement hashicorp/raft's LogStore and StableStore via relational backend.

it was: https://github.com/github/orchestrator/blob/972360609961bd66906ed1c604890d60ce827c79/go/raft/rel_store.go

shlomi-noach commented 7 years ago

Since all orchestrator nodes run detection: should this mean all orchestrator nodes run detection hooks?

shlomi-noach commented 7 years ago

tracking ops applied via raft:

leeparayno commented 7 years ago

Failure detection (so that we can get, if we choose to, a quorum opinion on the state of the failure)

@shlomi-noach I think this could be one of the bigger wins in moving to raft consensus. It might extend the time to make a decision on failover, but could possibly also reduce issues with identifying split-brain scenarios due to network partitions.

shlomi-noach commented 7 years ago

I think this could be one of the bigger wins in moving to raft consensus. It might extend the time to make a decision on failover, but could possibly also reduce issues with identifying split-brain scenarios due to network partitions.

@leeparayno it is one of the major catalysts for this development; credit @grypyrg for first bringing this to my attention over a year ago.

xiang90 commented 7 years ago

The etcd/raft golang library seems to me to be more complicated

etcd raft is not really complicated if you want to look into it. It is designed in a way that it is flexible and portable. It powers quite a few active and noticeable distributed systems: https://github.com/coreos/etcd/tree/master/raft#notable-users.

shlomi-noach commented 7 years ago

@xiang90 thank you :) it did seem to me to be more complicated to set up; right now I'm working with the hashicorp/raft library which I'm already familiar with. There are some limitations to the hashicorp implementation that I see etcd/raft doesn't share. For example, it seems like I can transfer leadership at will with etcd/raft, something that I'm unable to do with hashicorp/raft.

From the not-so-deep look I took into the etcd/raft code implementation, it seemed unclear to me what I need to implement; the sample projects are large scale and I confess I did not investigate the time in understanding how each of them embed etcd/raft.

However I don't want to make myself appear that lazy. It was easier to pick up on the hashicorp/raft library because there are some small and clear sample usage repos around, where I could not find the same for etcd/raft. This made it easier for me to pick up hashicorp/raft in the first place, and by now I have good understanding of its use and limitations.

xiang90 commented 7 years ago

the sample projects are large scale

where I could not find the same for etcd/raft.

Check out: https://github.com/coreos/etcd/tree/master/contrib/raftexample. Most of etcd/raft users started with it.

But, yea, etcd/raft is a lower level thing than most of other raft implantations for the reason I mentioned above. There is some effort to make it look like other raft (https://github.com/compose/canoe) without losing its flexibility.

shlomi-noach commented 7 years ago

Would anyone care to review these documents?

🙇

shlomi-noach commented 6 years ago

orchestrator/raft is in production for a few months now and we are happy. https://speakerdeck.com/shlominoach/orchestrator-on-raft-internals-benefits-and-considerations presents orchestartor/raft.