openark / orchestrator

MySQL replication topology management and HA
Apache License 2.0
5.6k stars 924 forks source link

orchestrator-raft: next steps. #246

Open shlomi-noach opened 7 years ago

shlomi-noach commented 7 years ago

The orchestrator-raft PR will soon be merged. It was out in the open for a few months now.

Noteworthy docs are:

There's already an interesting list of (mostly operational) issues/enhancements to orchestrator/raft that would not make it into #183. The next comment will list some of those enhancements, and I will add more comments as time goes by. Once I've reviewed all potential enhancements I'll create an Issue per enhancement.

shlomi-noach commented 7 years ago

thanks @renecannao @dbussink @sjmudd

shlomi-noach commented 7 years ago
shlomi-noach commented 7 years ago
shlomi-noach commented 7 years ago
derekperkins commented 6 years ago

@shlomi-noach We chatted about this briefly on Slack, but I wanted to bring the conversation here. I think that https://serf.io would pair well with Raft (both Hashicorp products) to solve the dynamic group problem. It allows you to run hooks on membership events, and a node can join the group by simply contact any of the members, and serf handles all the membership propagation.

This keeps Orchestrator standalone, without needing to add an extra service like Consul for discovery. It would also significantly simplify the Kubernetes installation, since static ips won't need to be assigned per Orchestrator node at config time.

shlomi-noach commented 6 years ago

@derekperkins Will this pair with Raft or replace Raft? Can you please elaborate?

derekperkins commented 6 years ago

This would pair with, not replace raft. All that Serf manages is membership, so this would allow you to dynamically populate IP addresses that are now specified in RaftNodes in the config. Serf would provide events for nodes entering/leaving, which you would then just use to add/remove entries in RaftNodes.

shlomi-noach commented 6 years ago

How will orchestrator differentiate a node leaving and a node dying? Would a leaving node necessarily send a "good bye" message? And if it misses it for some reason should it be considered as dead? Will kubernetes taking down a Pod allow such nodes to say good bye? Can I trust it?

If not, then the dynamic nature of the group is IMHO unsolvable. Consider a group of three. Two nodes are joining and two are gone. Have the two nodes died and we are at quorum 3/5, such that a single node dying would take us down? Or have they left and we are at 3/3 such that a single node dying would leave a happy quorum of 2/3?

derekperkins commented 6 years ago

cc @enisoc

derekperkins commented 6 years ago

That's a good point. I believe that we could hook into the event system of Kubernetes to notify Orchestrator on scaling events. Alternatively, it would be simple to poll the Stateful Set / Replica Set to see how many replicas were configured. Then Orchestrator would just have to expose an endpoint to set the total Node count to set quorum sizes.

If that's required, then that begs the question of whether or not serf is necessary. Orchestrator could stay agnostic and just provide add/drop endpoints, or a full RaftNodes replacement endpoint, and then any user of Orchestrator would be responsible for maintaining that list. It would be fully backwards compatible, as existing users would simply just not ever be updating the list, but it would enable dynamic clusters for people willing to set things up.

shlomi-noach commented 6 years ago

This further introduces complications. When you add/remove nodes, you may only communicate that to the leader. You need to announce to the leader "this node has left".

(BTW this will require an upgrade of the raft library, but aside the point)

And will kubernetes or your external tool orchestrate that well? What if the leader itself is one of the nodes to leave? It will take a few seconds for a new leader to step up. Will you initiate the step-down yourself (not supported bu hashicorp/raft, my own addition), first, then wait for a new leader to step up, then communicate to it the removal of the previous leader? Or would you just let the leader leave, then wait a few seconds, then look for the new leader, then advertise the fact others have left?

Another question which leaves me confused is how you would bootstrap it in the first place. The first node to run -- will it be the leader of a single node group? (you must enable that). Which one of the nodes, as you bootsrap a new cluster? How do you then disable the option for being the leader of a single node group? (because you don't want that; and at this time it seems to me like something that is fixed, cannot be changed dynamically).

I'm seeing a lot of open questions here, which get moved from one place to another.

I'm wondering, can we look at something that works similarly? Do you have a consul setup where kubernetes is free to remove and add nodes? Do you have an external script to run those change updates? Does that work? If we can mimick something that exists that would be best.

derekperkins commented 6 years ago

cc @bbeaudreault @acharis

shlomi-noach commented 6 years ago

I'm going to chat later with someone who has more experience with distributed systems, who may be able to clarify some points.

shlomi-noach commented 6 years ago

I had a discussion with @oritwas who has worked on, and still works on distributed systems and has implemented paxos in the past. Disclosure: my partner.

She was able to further present failure and split brain scenarios with dynamic consensus groups.

In her experience, going static was the correct approach, with possibly hiding host changes behind proxy/DNS. (which is what ClusterIP is doing). Pursuing closing the gaps with dynamic memberships complicated algorithm and code to extents where it was not worthwhile to code or feasible to maintain, especially at small group sizes, where a rolling restart would prove simpler and more reliable.

This would be especially true with orchestrator being a supporting part of the infrastructure, such that having orchestrator leaderless or completely down for a few seconds only imposes potential delay in recovering a failure, but no risk of missing anything.

enisoc commented 6 years ago

I'm wondering, can we look at something that works similarly? Do you have a consul setup where kubernetes is free to remove and add nodes?

I don't know how they do it, or what guarantees they make, but etcd-operator is able to dynamically manage members in an etcd cluster:

https://github.com/coreos/etcd-operator#resize-an-etcd-cluster

shlomi-noach commented 6 years ago

Had a discussion with @lefred from Oracle, about InnoDB Cluster's group communication. It seems there's a safe way to add/remove nodes:

I need to dig into this.