Understanding use of raft+serf in jocko

candlerb commented 6 years ago

I'd like to better understand how jocko is intended to use raft and serf, inclulding how the number of broker nodes is managed (i.e. how this interacts with performing membership changes in the consensus algorithm)

I'll start by outlining a couple of other systems to compare.

Kafka

In Kafka, you have a completely separate Zookeeper instance for storing state with its consensus algorithm. You could have (for example) a 3-node Zookeeper, but a cluster of 10 Kafka brokers.

AFAICS, Zookeeper is manually configured with its set of nodes in zookeeper.properties, and Kafka is configured with the list of Zookeeper nodes to communicate with in zookeeper.connect

There is a very specific process for growing the number of nodes in the Zookeeper cluster without breaking quorum.

Consul

Consul's architecture has:

Server nodes which implement the Raft state machine
Client nodes which make RPC requests to server nodes, but are stateless
All nodes engage in the gossip protocol. AFAICS, the main purpose of this is so that client nodes can discover server nodes, and for distributed failure detection

Configuration of each agent marks whether it is a -server or not. It's recommended that no more than 5 agents are running in server mode. The process of adding and removing server nodes without breaking Raft is documented. In particular, to shrink the Raft cluster you have to issue a "leave" command to the running cluster.

Adding a new server is straightforward. You can issue a join command pointing at any existing node (even a non-server node). It will learn the set of server nodes and configure itself correctly.

Jocko

This is where I'm unclear. It seems to me that the design of Jocko is that every broker is also a Raft node. This has a number of implications:

There seems to be little point in having a discovery/gossip protocol (serf) when you already know that the local broker is a member of the raft cluster. Finding the leader must be done via the raft protocol anyway.
Increasing the number of brokers will increase the size of the raft cluster, possibly beyond the optimum of 5, or to an even number.
Adding or removing brokers for data scaling/replication must also change the membership of the Raft cluster - a process which must be done with care. I can't see a way to issue a "leave" command to a running node. This is not something you could do over the Kafka protocol, so does this mean that each Jocko node will need a separate management socket for sending such commands?

Questions

It's not clear to me why Jocko chose to integrate Raft and Serf directly, rather than make use of a separate service (e.g. consul, etcd). Certainly it makes firing up a quick one-node cluster easier if there's only one daemon. But longer term, I think that a separate consensus cluster would be better tested, have well-documented bootstrap / scale up / scale down semantics, and allow you to scale the number of broker nodes separately from the number of consensus nodes.

It might be that Jocko eventually plans to allow different types of node (Broker+Raft and Broker Only). If so, I think it will require a bunch more configuration which will make it more complicated to setup - at which point, a separate consensus service would probably be clearer and simpler to operate, especially as you may already have it running for other purposes.

Also, the process for safely removing a Broker+Raft node will have to be carefully designed and tested - whereas with Kafka, the process for removing a broker node (which involves redistributing topics between brokers) is completely independent from removing a zookeeper node (which involves changing the member set of the consensus protocol)

sslavic commented 6 years ago

@candlerb FYI https://thehoard.blog/building-a-kafka-that-doesnt-depend-on-zookeeper-2c4701b6e961

candlerb commented 6 years ago

What it says in that article worries me, because I believe that membership changes of a consensus cluster need to be done in a controlled manner.

In other words:

The only safe way to know how many members there are in the Raft group and who they are, is to go through the Raft protocol - not gossip. Adding or removing nodes from the Raft group should be done by the Raft membership changes process as described in section 6 of the raft paper.

Consul uses gossip to advertise the server set to all nodes, but I don't believe it uses it to manage the server set (remember that a consul "server" is one which is a raft node). If you want to add or remove a server, there are explicit "join" and "leave" processes. Without this, you couldn't tell the difference between a server node failing, and the intentional permanent removal of a server node: the former does not change the quorum, the latter does.
All brokers need to agree on the broker topology - how many brokers there are, who they are, how partitions are assigned to leaders and replicas and so on. So this needs to go via Raft. You cannot simply have brokers appear and disappear, and rely on gossip to know who the confirmed member set is.

There may be some value in knowing which brokers are down, as there's no point in the leader attempting to replicate a partition to it - but you could just try and fail. You need to be able to handle the failure case anyway, because a node could have failed before the gossip protocol got around to detecting this.
Serf could be used by the brokers to learn each others' IP addresses and ports, but since the cluster configuration needs to go via Raft anyway, I don't see why you'd want a separate discovery service: just store the addresses and ports in Raft.

The section of that article headed "How Jocko handles state changing requests" is fine. It shows topic state updates going via Raft, which is the right thing to do. But that process doesn't require any gossip.

I suppose the fundamental question is this: in the long run, is it intended that in a Jocko cluster, ALL brokers are Raft servers, or A SUBSET of brokers are Raft servers?

If the answer is "ALL brokers": then there is no need for gossip, since every broker is a raft node, and it can learn all it needs to know through raft. But this setup won't scale: a 20-node broker cluster also implies a 20-node Raft cluster. Plus, there will be problems growing and shrinking the cluster, as you have to change the raft membership at the same time as you change the broker topology.

If the answer is "A SUBSET of brokers": then the gossip protocol does have value (to locate at least one Raft server node), and it will scale. But then you effectively have a separate pool of brokers and a separate pool of raft servers. You might as well just run a separate consul or etcd, which is much simpler to understand and manage.

hashgupta commented 6 years ago

@candlerb so if am understanding you correctly, serf’s role with service discovery and configuration is redundant because raft can store all the config that needed. Therefore we can remove serf as a dependency?

candlerb commented 6 years ago

IFF every broker will also be a raft node/replica, then there's no need for service discovery.

Consul separates the concepts for scalability. If you have 1000 nodes in a data centre, you definitely do not want 1000 raft replicas (with a quorum of 501) as it would be terribly inefficient. What you want is 3 or 5 stable raft replicas, plus a way for the other 997 or 995 nodes to find them.

travisjeffery commented 6 years ago

Not all brokers will be raft node's/replicas cause it's not necessary. The plan is to make only a subset of brokers have to run raft, all run serf for service discovery.

travisjeffery commented 6 years ago

The point for building in serf/raft is that it isn't that much work to do so, and I don't have to rely on another service. There are also advantages in terms of control and more hooks to tie into.

candlerb commented 6 years ago

The plan is to make only a subset of brokers have to run raft, all run serf for service discovery.

IMO that is the right answer; the one which scales anyway.

But looking at the command line options, there doesn't seem to be any way at the moment to create a cluster with a mixture of raft and non-raft brokers; add or remove a raft broker; add or remove a non-raft broker; and so on. These are non-trivial operations, especially changing the number of raft nodes. You might also want to have nodes which run raft but are not a broker at all (e.g. a set of reliable, well-connected machine but without much disk space).

There is much complexity to add yet before this works. However, if you use a separate service like consul or etcd, all these aspects are taken care of, and documented. To me, that's the right way to do it: follow the Unix model. Each component does one thing and does it well. Connect together the components you need.

eugene-bright commented 3 years ago

Are there any plans to support etcd in addition to the current raft+serf schema?

travisjeffery / jocko