The clustering megathread

coffeemug commented 10 years ago

Our clustering implementation has a lot of limitations, both in performance/scalability, and operations. We have lots of open issues about various problems. Many of them are related on the product level, many are related on the code/architecture level, many are unrelated. We'll have to do significant reworking and I'd like to start a discussion about the overall refactor/rearchitecture here.

Here are the issues we'll need to fix:

1900 -- improve blueprint suggester to make sure automatic node allocations are (more) balanced in various scenarios.
1876, #223 -- automatic failover.
1774 -- sharding without losing availability.
1861 -- directory scalability.
364, #1679 -- consistent hash sharding.
1862 -- custom sharding functions. That is, I want primaries for my US users to be in datacenter A, and primaries for my EU users to be in datacenter B.
1858, #1859 -- more flexible outdated reads routing. E.g. "run outdated, but make sure it's not older than token X", "run outdated and route to datacenter A", "run outdated if there is no primary in the DC I'm connected to, otherwise run up-to-date".
1792, #1392 -- programmatic access to cluster/statistics via ReQL.
1906 -- make networking more datacenter-aware (e.g. can servers within one datacenter communicate only through private IPs/internal network interfaces?)
83, #301, #1905 -- cluster setup flow. Specifically duplicate database "test" issues, and connecting two disjoint clusters by accident.

EDIT:

Better issue resolution. A problem with one table shouldn't make administration of the whole cluster impossible until the issue is resolved. A machine down that doesn't result in unsatisfiable goals shouldn't be treated as an emergency. A netsplit between datacenters shouldn't inundate the web ui in a hundred issues. We need to treat machine failure and nesplits like regular events rather than something out of the ordinary.

EDIT2:

2119 -- better/different presentation of acks/replicas, and possibly some latency minimization query routing strategies.

We could make this a bazaar or a cathedral or anything in between. Pinging @Tryneus and @timmaxw. I'd like your thoughts and proposal on how to go about this.

EDIT3:

2083 -- our query routing algorithm (basically we pick a random node) is too simple and causes latency issues.

timmaxw commented 10 years ago

A lot of these suggestions involve adding features to the cluster administration code. We should consider making it possible for external programs to bypass/replace the C++ cluster administration code. I made a separate issue to keep the discussion organized. See #1913.

Tryneus commented 10 years ago

So I went through and identified the biggest problems, prioritized them, and have a rough outline of a proposal in 6 phases:

Phase 1: Directory and Blueprint optimization

Remove the NOTHING role from the Blueprint
- Will drastically reduce the size of the blueprint in large clusters
Remove the nothing role from the Directory
- Will drastically reduce the size of the directory in large clusters
- Leave the nothing_when_safe and nothing_when_done_erasing roles
Need to know when to spawn nothing roles to perform backfilling and deletion

Phase 2: Auto-Failover

Integrate a Paxos or Raft (or some other consensus) library as a new service in the connectivity layer
- As opposed to a daemon, which would add more dependencies and more complicated cluster management for users
- BSD or otherwise free library we can build straight in or staticly link would be preferred
- libpaxos3 (https://bitbucket.org/sciascid/libpaxos) - should be able to work this into our architecture without much trouble
- CRaft (https://github.com/willemt/CRaft) - simpler interface, but missing some key features, we could contribute
- Alternative is to implement a consensus algorithm ourselves, but that is a last resort
Consensus library will be used to synchronize a Master Failover Table
- When the master failover table changes, reactor roles need to be reevaluated
There appears to be a window for data divergence in how writes are committed
- When recovering from a failover, we need the new master to eclipse the old master, if divergence is detected.
- This should not be an error from the user's perspective, as any rollbacked writes will not have been acked to any client

Process to elect a new master: On peer disconnect

For each shard the peer was master of (known by checking the blueprint and master failover table):

If the table has a number of write acks greater than half the number of replicas (rounded up), we can attempt failover without worrying about data divergence

Propose a new master from the list of connected peers hosting a replica for that shard

Leader polls each replica for the shard to ensure that contact to the peer is gone

If at least half the replicas (rounded up) cannot contact the peer, commit the change to the master failover table

On peer reconnect

For each shard the peer should be master of, but isn't (known by checking the blueprint and master failover table):

Propose that the peer becomes master

Leader polls each replica for the shard to ensure that contact to the peer is restored

If at least half the replicas (rounded up) can contact the peer, commit the change to the master failover table

Phase 3: ReQL Cluster API

r.cluster() operations to change semilattice metadata
Ideally, treat the semilattice metadata as a virtual table or tables
Change the admin CLI and web UI to use this interface

Phase 4: Minimal downtime on blueprint change

Change reactor shutdown such that a machine will continue serving queries until a new master is available
Could probably reuse a lot of the auto failover logic

Phase 5: Blueprint generation without full connectivity

Not sure if this is possible, but we need it

Phase 6: More Directory and Blueprint optimization

Use hash maps instead of trees for table and mailbox lookup
- Important if we want to scale to hundreds of thousands of tables
Consider using a single mailbox per core for all tables, rather than one for each table
- This would make the directory much smaller, at the cost of complicating reactor transitions
Deterministic blueprints
- Move the blueprint out of the cluster semilattice metadata
- Save time distributing blueprints, which could feasibly be dozens of MB in size
- Instead, pass around blueprint hashes, and only send the entire blueprint if there is a desync

As you can see, there are a number of open questions. The biggest are: Phase 2: Which consensus library to use and how to integrate it into our clustering system Phase 3: ReQL Cluster API description Phase 3: How to interface a virtual table with the rest of the ReQL protocol Phase 5: Feasibility and architecture design

Phases 1 and 6 are optimization changes, but phase 1 is there because it should be relatively easy and hits one of the biggest bottlenecks in the current state of things. Phases 2 and 3 are necessary for a production-ready product. I wouldn't say phases 4 and 5 are necessary for production-readiness, but I would strongly recommend them.

danielmewes commented 10 years ago

@Tryneus This sounds like a great plan.

Depending on how difficult phase 5 is, we should probably do it earlier because it impacts users a lot.

timmaxw commented 10 years ago

Phase 1 sounds like a really good optimization. When removing "nothing" from the directory, there is a small complication to watch out for. When the current system is switching between two roles, it temporarily has no directory entry. For example, when it is switching from "primary" to "secondary", it first removes the "primary" entry and then adds the "secondary" entry, so it briefly has no entry at all. It's important to distinguish between this case and a true "nothing". This should be easy to fix by inserting a placeholder during the switch-over, or perhaps by leaving the old entry until the new entry is ready.

Your description for Phase 2 is not very clear, and a lot of details are missing. Would you mind writing it up in more detail? Perhaps it should be a separate issue, to keep the discussion organized.

You propose to move the web UI into a separate process (presumably Python) in Phase 3. I suggest you consider moving the suggester and auto-failover logic into the separate process as well. Then, if you do Phase 3 before Phase 2, the consensus system could be integrated in Python rather than C++, which might be nicer.

coffeemug commented 10 years ago

You propose to move the web UI into a separate process (presumably Python) in Phase 3

I don't think @Tryneus proposed that. I think he meant implementing a ReQL API to conveniently manage the cluster via client drivers, and then switching the WebUI and the Admin UI to use this API instead of writing to semilattices directly. That doesn't require moving the WebUI out.

timmaxw commented 10 years ago

Oh! I thought he was proposing to partially implement the idea from #1913. Oops.

Tryneus commented 10 years ago

I think #1913 is a good idea, and it could even fit into one of the phases I proposed above, but it is also orthogonal to a lot of these features. It would be really nice if we could use some higher-level constructs for dealing with the goals, blueprints, and even consensus (the libraries available for C/C++ are pretty sparse), but at the same time, it wouldn't necessarily affect the user experience for some time.

timmaxw commented 10 years ago

I agree. The advantage of doing it soon is that the longer we wait, the more code we have to write in C++ and later port to Python. The disadvantage is that it will take a lot of time, and there is no direct payoff in terms of user experience or making any particular issue easier to solve. So the correct answer depends on the development schedule, which I don't know very much about. I just wanted to suggest that you keep it in mind. :smiley:

mlucy commented 10 years ago

I'm extremely skeptical of outside paxos libraries.

libpaxos3 looks like it was written by some dude (http://atelier.inf.usi.ch/~sciascid/), has like 4 simple tests (https://bitbucket.org/sciascid/libpaxos/src/20414d195443e9fe82973f0a0be8c5a3bd24e954/unit/?at=master), doesn't appear to have a bug tracker, had a super-low-traffic mailing list (http://sourceforge.net/mailarchive/forum.php?forum_name=libpaxos-general), and has bug reports in said mailing list that don't seem to all be resolved.

mlucy commented 10 years ago

If we are forced to choose an external library, we should also look at https://github.com/logcabin/logcabin . It claims it isn't ready for production use, which I honestly take as a positive signal in this case, and it's written by Diego Ongaro, who's one of the authors on https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf .

josephglanville commented 10 years ago

@mlucy I would also vote for Logcabin. It's the reference implementation of Raft. You wouldn't need to use the entire implementation, it would be feasible to just extract the Raft algorithm (1200LOC or so) and implement your own log + snapshotting.

coffeemug commented 10 years ago

Also note #2083 is a part of this.

neumino commented 10 years ago

I talked a little with @timmaxw last Monday and asked him a few questions. I don't remember all of them, but here are a few (feel free to open an issue if they are relevant)

How are we going to relieve the master of a shard (now it takes all the writes+read) -- slightly related to https://github.com/rethinkdb/rethinkdb/issues/2119
Can we remove the state of each hash shard and instead send back one value (whether the shard is ready or not). That should decreaze the size of the directory roughly by 8.
Can we clean the content of ajax/stat. They are pretty big, and the web interface filter them with a regex, but from what I know, the whole thing is sent between servers, which probably create a lot of intra-cluster traffic. And actually, same question for the content in ajax

I'm pretty sure there was other stuff, but I somehow can't remember. I'll add another comment if something comes up.

danielmewes commented 9 years ago

The ideas in here have pretty much been translated into the ReQL admin interface that we shipped with 1.16 and the Raft rework that @timmaxw has been working on for the past months.

There are some separate issues (query routing, hash sharding etc.) that are already tracked elsewhere.

I think this thread has outlived its usefulness.

rethinkdb / rethinkdb

The clustering megathread #1911

1900 -- improve blueprint suggester to make sure automatic node allocations are (more) balanced in various scenarios.

1876, #223 -- automatic failover.

1774 -- sharding without losing availability.

1861 -- directory scalability.

364, #1679 -- consistent hash sharding.

1862 -- custom sharding functions. That is, I want primaries for my US users to be in datacenter A, and primaries for my EU users to be in datacenter B.

1858, #1859 -- more flexible outdated reads routing. E.g. "run outdated, but make sure it's not older than token X", "run outdated and route to datacenter A", "run outdated if there is no primary in the DC I'm connected to, otherwise run up-to-date".

1792, #1392 -- programmatic access to cluster/statistics via ReQL.

1906 -- make networking more datacenter-aware (e.g. can servers within one datacenter communicate only through private IPs/internal network interfaces?)

83, #301, #1905 -- cluster setup flow. Specifically duplicate database "test" issues, and connecting two disjoint clusters by accident.

2119 -- better/different presentation of acks/replicas, and possibly some latency minimization query routing strategies.

2083 -- our query routing algorithm (basically we pick a random node) is too simple and causes latency issues.