shlomi-noach commented 7 years ago

Objective

cross DC orchestrator deployment with consensus for failovers and mitigating fencing scenarios.

secondary (optional) objective: remove MySQL backend dependency.

Current status

At this time (release ) orchestrator nodes use a shared backend MySQL server for communication & leader election.

The high availability of the orchestrator setup is composed of:

Multiple orchestrator service nodes
MySQL backend high availability.

The former is easily achieved by running more nodes, all of which connect to the backend database.

The latter is achieved via:

1. Circular Master-Master replication

For example, one MySQL master and one orchestrator service in dc1, one MySQL master and one orchestrator service in dc2
This setup is susceptible to fencing and split brains. A network partition between dc1 and dc2 may cause both orchestrator services to consider themselves as leaders. Such a network partition is likely to also break cross DC replication that orchestrator is meant to tackle in the first place, leading to independent failovers by both orchestrator services, leading to potential chaos.

2. Galera/XtraDB/InnoDB Cluster setup

This is a consensus setup, where quorum is required to apply changes.
Typically recommended as internal-DC setup.
A setup would have, for example:
- 3 Galera nodes, in multi-writer mode (each node is writable)
- 3 orchestrator nodes
- Each orchestrator node connects to a specific Galera node
- Leadership achieved by way of Galera consensus
A potential setup is to run Galera cross-DC: one Galera node in each DC
- in such a setup the network partitioning of a single DC (say dc1) still leaves 2 connected Galera nodes, making a quorum. By nature of orchestrator leader election, the orchestrator service in dc1 cannot become a leader. One of the other two will.
- Depending on the size of your clusters and on probing frequency, there may be dozens, hundreds or more statements per sec passing between the Galera nodes.

A different use case; issues with current design

With existing design, one orchestrator node is the leader, and only the leader discovers and probes MySQL servers. There is no sense in having multiple probing orchestartor nodes because they all use and write to the same backend DB.

By virtue of this design, only one orchestrator is running failure detection. There is no sense in having multiple orchestrator run failure detection because they all rely on the exact same dataset.

orchestrator uses a holistic approach to detecting failure (e.g. in order to declare master failure it conslults replicas to confirm they think their master is broken, too). However, this detection only runs from a single node, and is hence susceptible to network partitioning / fencing.

If the orchestrator leader runs on dc1, and dc1 happens to be partitioned away, the leader cannot handle failover to servers in dc2.

The cross-DC Galera layout, suggested above, can solve this case, since the isolated orchestrator node will never be the active node.

We have a use case where we not only don't want to rely on Galera, we also don't even want to rely on MySQL. We want to have a more lightweight, simpler deployment without the hassle of extra databases.

Our specific use case lights up a new design offered in this Issue, bottom-to-top; but let's now observe the offered design top-to-bottom.

orchestrator/raft design

The orchestrator/raft design suggests:

orchestrator nodes to communicate via raft
Each orchestrator node to have its own private database backend server
- This does not necessarily have to be a MySQL and we actually have a POC for SQLite as backend DB.
Backend databases are completely independent and are unaware of each other. Each orchestrator node is to handle its own private DB. There is no replication. There is no DB High Availability setup.
All orchestrator nodes run independent detections (each is probing the MySQL servers independently, and typically they should all end up seeing the same picture).
orchestrator nodes to communicate between themselves (via raft) changes that are not detected by probing the MySQL servers.
All orchestrator nodes run failure detection. They each have their own dataset to analyze.
Only one orchestrator is the leader (decided by raft consensus)
Only the leader runs recoveries
The leader may consult with other orchestrator nodes, getting a quorum approval for "do you all agree there's a failure case on this server?"

Noteworthy is that cross-orchestrator communication is sparse; health-messages will run once per second, and other than that the messages will be mostly user-initiated input, such as begin-downtime or recovery steps etc. See breakdown further below.

Implications

Since each orchestrator node has its own private backend DB, there's no need to sync the databases. There is no replication. Each orchestrator node is responsible for maintaining its own DB.
There is no specific requirement for MySQL. In fact, there's a POC running on SQLite.
- SQLite is embedded in the orchestrator binary
- A single DB file is required, as opposed to a full blown MySQL deployment
We get failure detection quorum. As illustrated above, multiple independent orchestrator nodes will each run failure analysis.
- If we run each orchestrator node in its own DC, then we have fought fencing:
- If one DC is partitioned away, the orchestrator ndoe in that DC is isolated, hence cannot be the leader
- The other orchestrator nodes will agree on the type of failure; they will make for a quorum. One of them will be the leader, which will kick in the failover.

Is this a simpler or a more complex setup?

An orchestrator/raft/sqlite setup would be a simpler setup, which does not involve provisioning MySQL servers. One would need to config orchestrator with the raft nodes identities, and orchestrator will take it from there.

An orchestrator/raft/mysql is naturally more complex than orchestrator/raft/sqlite, however:

You don't need to maintain a MySQL replication setup for the orchestrator backend
You don't need to maintain a MySQL failover setup for the orchestrator backend

Implementation notes

All group members will run independent discovery, and so general discovery information doesn't need to be passed between orchestrator nodes.
The following information will need to pass between orchestrator nodes as group messages:
- begin-downtime (but can be discarded if end of downtime is already in the past)
- end-downtime
- begin-maintenance (but can be discarded if end of maintenance is already in the past)
- end-maintenance
- forget
- discover, so that completely new instances can be shared with all orchestrator nodes
- we may potentially periodically run a full sync of instances discovery from leader to followers. This would either follow a checksum comparison or be blindly imposed. It can be quite the overhead for very large setups, which is why I'd prefer an incremental, comparison based messaging.
- submit-pool-instances, user generated info mapping instances to a pool
- register-candidate - a user-based instruction to flag instances with promotion rules
- node health (but can be discarded if health declaration timestamp is too old)
- Failure detection (so that we can get, if we choose to, a quorum opinion on the state of the failure)
- Recovery (so that all nodes have both recovery history as well as the info needed for anti-flapping)
- including recovery step audit
- ack-cluster-recoveries
- register-hostname-unresolve
- deregister-hostname-unresolve
Easiest setup would be to load balance orchestrator nodes behind proxy, (e.g. haproxy), such that the proxy would only direct traffic to the active (leader) node.
- Implied from this setup: group messages will be sent from the leader (who will natually accept all incoming requests) to the followers.
- isAuthorizedForAction can be extended to reject requests on a follower.
- This setup makes for an easy introduction of orchestrator/raft -- I'm going to take this approach at first.
- Also implied is that we must always talk to the leader
- The followers will reject write requests
- We may not used orchestrator CLI, because that would directly talk to the DB instead of going through raft.
  - For read-only operations, this may still be OK. Running locally on a box where orchestrator service is running, it is OK to open the SQLite DB file from another process (the orchestrator CLI) and read from it.
Add /api/leader-check, to be used by load-balancers. Leader will return 200 OK and followers will return 404 Not Found
We'd need to be able to bootstrap an empty DB node joining a quorum. For example, if one node out of 3 completely burns, there's still a quorum, but adding a newly provisioned node requires loading of all the data.
- We can require the user to do so:
- with SQLite, copy+paste DB file
- with MySQL, dump + import orchestrator schema
- I think this makes sense for starters.

cc @github/database-infrastructure @dbussink

shlomi-noach commented 7 years ago

Consensus implementation discussion: there are multiple ways to implement this, all of which have pros and cons. I'll list only the few that I deem likely:

Use the hashicorp/raft golang library for leader election & gossip
Use Consul as an external tool; get leader election from Consul. Attempt chatter via key-value store
Use etcd/raft golang library
Embed etcd
Use etcd as an external tool

A discussion of the merits of various approaches follows.

Some considerations

The orchestrator nodes would need to be able to elect a leader.

The orchestrator leader will need to advertise changes to its followers. It is advisable that those advertisements enjoy a quorum consensus, or else they may get lost.

Example for changes that need to be advertised: begin-downtime, begin recovery, register-candidate etc. (see initial comment).

Some perceptions and idle thoughts

I have successfully worked with the hashicorp/raft library while authoring freno. Change advertising can be a bit of a hassle, but otherwise this library supports:

Leader election & checks
Gossip; one may advertise a message to the raft group and confirm that the message was received by a quorum. The message is but a BLOB, and this is where the code hassled a bit for encoding & decoding the messages. No big deal.
Snapshotting (log reduction). Actually the snapshot/load operations are trivial, nil, since we're using a backend database (MySQL/SQLite) anyhow, and all message changes are persisted.

Notable that while Consul uses the hashicorp/raft library, it does not advertise same functionality as a service.

It does serve leader election
There is no direct gossiping. One may write to the key/value store which itself uses gossip and achieves quorum consensus.
- For nodes to advertise changes to each other, they will need to write to K/V on one hand, and watch changes to the K/V on the other hand.
(noting that snapshotting is irrelevant since it is internal to Consul's operation)

Assuming stable leadership, we may change the way we advertise changes as following:

The leader may advertise a change that is "something that has to do with begin-downtime"
- Not as much JSON needs to be encoded into the gossip protocol
The followers will receive that hint; they will contact the leader via orchestrator API to get more information
- But if the leader dies at that time, the change is lost.

K/V-wise, etcd and consul are comparable.

Choosing a leader via raft as external service is not as trivial (that I can see).

The etcd/raft golang library seems to me to be more complicated than the hashicorp/raft one; and I already have experience with hashicorp/raft.

Embedded etcd removes dependency on 3rd party tool; this is nice and appealing.

The hashicorp/raft golang library also requires setting up of a store. If we wish to stick to pure go, this is raft-boltdb. However with sqlite we already give up pure-go and use cgo, so we may endup using the default raft-mdb. Notable that there will be a store file on disk, in addition to our already existing backend DB.

shlomi-noach commented 7 years ago

Further idle thought: it should be easy enough to implement hashicorp/raft's LogStore and StableStore via relational backend. If I'm not mistaken it is almost trivial.

shlomi-noach commented 7 years ago

it should be easy enough to implement hashicorp/raft's LogStore and StableStore via relational backend.

it was: https://github.com/github/orchestrator/blob/972360609961bd66906ed1c604890d60ce827c79/go/raft/rel_store.go

shlomi-noach commented 7 years ago

Since all orchestrator nodes run detection: should this mean all orchestrator nodes run detection hooks?

shlomi-noach commented 7 years ago

tracking ops applied via raft:

[x] begin-downtime (but can be discarded if end of downtime is already in the past)
[x] end-downtime
[ ] ~~begin-maintenance (but can be discarded if end of maintenance is already in the past)~~ Not really needed since upon leader change the maintenance becomes irrelevant and expired anyhow
[ ] ~~end-maintenance~~ Not really needed since upon leader change the maintenance becomes irrelevant and expired anyhow
[x] forget
[x] discover, so that completely new instances can be shared with all orchestrator nodes
- [ ] we may potentially periodically run a full sync of instances discovery from leader to followers. This would either follow a checksum comparison or be blindly imposed. It can be quite the overhead for very large setups, which is why I'd prefer an incremental, comparison based messaging.
[x] submit-pool-instances, user generated info mapping instances to a pool
[x] register-candidate - a user-based instruction to flag instances with promotion rules
[ ] node health (but can be discarded if health declaration timestamp is too old)
[x] Failure detection (so that we can get, if we choose to, a quorum opinion on the state of the failure)
[x] Recovery (so that all nodes have both recovery history as well as the info needed for anti-flapping)
- [x] including recovery step audit
- [x] ack-cluster-recoveries
[x] register-hostname-unresolve
[x] deregister-hostname-unresolve

leeparayno commented 7 years ago

Failure detection (so that we can get, if we choose to, a quorum opinion on the state of the failure)

@shlomi-noach I think this could be one of the bigger wins in moving to raft consensus. It might extend the time to make a decision on failover, but could possibly also reduce issues with identifying split-brain scenarios due to network partitions.

shlomi-noach commented 7 years ago

I think this could be one of the bigger wins in moving to raft consensus. It might extend the time to make a decision on failover, but could possibly also reduce issues with identifying split-brain scenarios due to network partitions.

@leeparayno it is one of the major catalysts for this development; credit @grypyrg for first bringing this to my attention over a year ago.

xiang90 commented 7 years ago

The etcd/raft golang library seems to me to be more complicated

etcd raft is not really complicated if you want to look into it. It is designed in a way that it is flexible and portable. It powers quite a few active and noticeable distributed systems: https://github.com/coreos/etcd/tree/master/raft#notable-users.

shlomi-noach commented 7 years ago

@xiang90 thank you :) it did seem to me to be more complicated to set up; right now I'm working with the hashicorp/raft library which I'm already familiar with. There are some limitations to the hashicorp implementation that I see etcd/raft doesn't share. For example, it seems like I can transfer leadership at will with etcd/raft, something that I'm unable to do with hashicorp/raft.

From the not-so-deep look I took into the etcd/raft code implementation, it seemed unclear to me what I need to implement; the sample projects are large scale and I confess I did not investigate the time in understanding how each of them embed etcd/raft.

However I don't want to make myself appear that lazy. It was easier to pick up on the hashicorp/raft library because there are some small and clear sample usage repos around, where I could not find the same for etcd/raft. This made it easier for me to pick up hashicorp/raft in the first place, and by now I have good understanding of its use and limitations.

xiang90 commented 7 years ago

the sample projects are large scale

where I could not find the same for etcd/raft.

Check out: https://github.com/coreos/etcd/tree/master/contrib/raftexample. Most of etcd/raft users started with it.

But, yea, etcd/raft is a lower level thing than most of other raft implantations for the reason I mentioned above. There is some effort to make it look like other raft (https://github.com/compose/canoe) without losing its flexibility.

shlomi-noach commented 7 years ago

Would anyone care to review these documents?

🙇

shlomi-noach commented 6 years ago

orchestrator/raft is in production for a few months now and we are happy. https://speakerdeck.com/shlominoach/orchestrator-on-raft-internals-benefits-and-considerations presents orchestartor/raft.

openark / orchestrator

Design & RFC: orchestrator on raft #175

Objective

Current status

1. Circular Master-Master replication

2. Galera/XtraDB/InnoDB Cluster setup

A different use case; issues with current design

orchestrator/raft design

Implications

Is this a simpler or a more complex setup?

Implementation notes

Some considerations

Some perceptions and idle thoughts