orchestrator-raft: next steps.

shlomi-noach commented 7 years ago

The orchestrator-raft PR will soon be merged. It was out in the open for a few months now.

Noteworthy docs are:

There's already an interesting list of (mostly operational) issues/enhancements to orchestrator/raft that would not make it into #183. The next comment will list some of those enhancements, and I will add more comments as time goes by. Once I've reviewed all potential enhancements I'll create an Issue per enhancement.

shlomi-noach commented 7 years ago

[ ] Dynamic groups, joining a group. Right now (as per #183) the raft cluster is fixed, pre-defined in config file. e.g. on a 3 node cluster each of the node has all 3 nodes listed in config. A new node cannot join. Operations such as replacing a node or increasing the group are possible but not very intuitive. Dynamically joinign a group should be made possible.
[x] consider what it would take to auto-bootstrap a node joining the group. The hashicorp/raft library supports that out of the box by way of shipping a snapshot to the new node ; however I intentionally broke snapshots seeing that a "snapshot" is the entire RDBMS dataset (depicting this as a raft "snapshot" would entail much programmatic pain). So see what can be done.
[x] orchestrator non-leader nodes to auto-forward HTTP requests to the leader, such that the app does not need a proxy to access orchestrator (it can connect to any node in the quorum). See https://github.com/github/orchestrator/issues/245
[x] orchestrator-client script to auto-figure out identify of the leader (given multiple node addresses). Like the above, this results with the app not needing to know identity of the leader up front. See https://github.com/github/orchestrator/issues/245#issuecomment-318344515

thanks @renecannao @dbussink @sjmudd

shlomi-noach commented 7 years ago

[ ] support --raft-enable-single-mode - a risky yet required feature for allowing single-mode; this is useful when something bad happens to the cluster; loses quorum or whatever. This allows a single node to be its own group and its own leader.

shlomi-noach commented 7 years ago

[ ] per-config, split discovery among the raft nodes. The orchestrator/raft approach is to make every node independent, each orchestrator nodes does full probing of the topologies.

Split-discovery will be useful for very large environments. In the split-discovery suggestion, the leader will tell its followers who needs to probe with MySQL server, and the results will be submitted back to the leader.

Pros:
- Less traffic, scale out for orchestrator accessing the topologies Cons:
- Nodes are not independent. Will take some time following a DC network isolation, for example, for the leader to get a clearer picture (redistribute the checks among remaining followers).

shlomi-noach commented 7 years ago

[ ] raft configuration using hostnames as opposed to IP addresses, see https://github.com/github/orchestrator/issues/253

derekperkins commented 6 years ago

@shlomi-noach We chatted about this briefly on Slack, but I wanted to bring the conversation here. I think that https://serf.io would pair well with Raft (both Hashicorp products) to solve the dynamic group problem. It allows you to run hooks on membership events, and a node can join the group by simply contact any of the members, and serf handles all the membership propagation.

This keeps Orchestrator standalone, without needing to add an extra service like Consul for discovery. It would also significantly simplify the Kubernetes installation, since static ips won't need to be assigned per Orchestrator node at config time.

shlomi-noach commented 6 years ago

@derekperkins Will this pair with Raft or replace Raft? Can you please elaborate?

derekperkins commented 6 years ago

This would pair with, not replace raft. All that Serf manages is membership, so this would allow you to dynamically populate IP addresses that are now specified in RaftNodes in the config. Serf would provide events for nodes entering/leaving, which you would then just use to add/remove entries in RaftNodes.

shlomi-noach commented 6 years ago

How will orchestrator differentiate a node leaving and a node dying? Would a leaving node necessarily send a "good bye" message? And if it misses it for some reason should it be considered as dead? Will kubernetes taking down a Pod allow such nodes to say good bye? Can I trust it?

If not, then the dynamic nature of the group is IMHO unsolvable. Consider a group of three. Two nodes are joining and two are gone. Have the two nodes died and we are at quorum 3/5, such that a single node dying would take us down? Or have they left and we are at 3/3 such that a single node dying would leave a happy quorum of 2/3?

derekperkins commented 6 years ago

cc @enisoc

derekperkins commented 6 years ago

That's a good point. I believe that we could hook into the event system of Kubernetes to notify Orchestrator on scaling events. Alternatively, it would be simple to poll the Stateful Set / Replica Set to see how many replicas were configured. Then Orchestrator would just have to expose an endpoint to set the total Node count to set quorum sizes.

If that's required, then that begs the question of whether or not serf is necessary. Orchestrator could stay agnostic and just provide add/drop endpoints, or a full RaftNodes replacement endpoint, and then any user of Orchestrator would be responsible for maintaining that list. It would be fully backwards compatible, as existing users would simply just not ever be updating the list, but it would enable dynamic clusters for people willing to set things up.

shlomi-noach commented 6 years ago

This further introduces complications. When you add/remove nodes, you may only communicate that to the leader. You need to announce to the leader "this node has left".

(BTW this will require an upgrade of the raft library, but aside the point)

And will kubernetes or your external tool orchestrate that well? What if the leader itself is one of the nodes to leave? It will take a few seconds for a new leader to step up. Will you initiate the step-down yourself (not supported bu hashicorp/raft, my own addition), first, then wait for a new leader to step up, then communicate to it the removal of the previous leader? Or would you just let the leader leave, then wait a few seconds, then look for the new leader, then advertise the fact others have left?

Another question which leaves me confused is how you would bootstrap it in the first place. The first node to run -- will it be the leader of a single node group? (you must enable that). Which one of the nodes, as you bootsrap a new cluster? How do you then disable the option for being the leader of a single node group? (because you don't want that; and at this time it seems to me like something that is fixed, cannot be changed dynamically).

I'm seeing a lot of open questions here, which get moved from one place to another.

I'm wondering, can we look at something that works similarly? Do you have a consul setup where kubernetes is free to remove and add nodes? Do you have an external script to run those change updates? Does that work? If we can mimick something that exists that would be best.

derekperkins commented 6 years ago

cc @bbeaudreault @acharis

shlomi-noach commented 6 years ago

I'm going to chat later with someone who has more experience with distributed systems, who may be able to clarify some points.

shlomi-noach commented 6 years ago

I had a discussion with @oritwas who has worked on, and still works on distributed systems and has implemented paxos in the past. Disclosure: my partner.

She was able to further present failure and split brain scenarios with dynamic consensus groups.

In her experience, going static was the correct approach, with possibly hiding host changes behind proxy/DNS. (which is what ClusterIP is doing). Pursuing closing the gaps with dynamic memberships complicated algorithm and code to extents where it was not worthwhile to code or feasible to maintain, especially at small group sizes, where a rolling restart would prove simpler and more reliable.

This would be especially true with orchestrator being a supporting part of the infrastructure, such that having orchestrator leaderless or completely down for a few seconds only imposes potential delay in recovering a failure, but no risk of missing anything.

enisoc commented 6 years ago

I'm wondering, can we look at something that works similarly? Do you have a consul setup where kubernetes is free to remove and add nodes?

I don't know how they do it, or what guarantees they make, but etcd-operator is able to dynamically manage members in an etcd cluster:

https://github.com/coreos/etcd-operator#resize-an-etcd-cluster

shlomi-noach commented 6 years ago

Had a discussion with @lefred from Oracle, about InnoDB Cluster's group communication. It seems there's a safe way to add/remove nodes:

if and only if one at a time is added or removed
and the quorum agrees that the node is gone (and the quorum remembers that the node is gone, such that it will not let it back in without quorum agreement)

I need to dig into this.

openark / orchestrator

orchestrator-raft: next steps. #246