rethinkdb / rethinkdb

The open-source database for the realtime web.
https://rethinkdb.com
Other
26.78k stars 1.86k forks source link

The clustering megathread #1911

Closed coffeemug closed 9 years ago

coffeemug commented 10 years ago

Our clustering implementation has a lot of limitations, both in performance/scalability, and operations. We have lots of open issues about various problems. Many of them are related on the product level, many are related on the code/architecture level, many are unrelated. We'll have to do significant reworking and I'd like to start a discussion about the overall refactor/rearchitecture here.

Here are the issues we'll need to fix:

EDIT:

EDIT2:

We could make this a bazaar or a cathedral or anything in between. Pinging @Tryneus and @timmaxw. I'd like your thoughts and proposal on how to go about this.

EDIT3:

timmaxw commented 10 years ago

A lot of these suggestions involve adding features to the cluster administration code. We should consider making it possible for external programs to bypass/replace the C++ cluster administration code. I made a separate issue to keep the discussion organized. See #1913.

Tryneus commented 10 years ago

So I went through and identified the biggest problems, prioritized them, and have a rough outline of a proposal in 6 phases:


Phase 1: Directory and Blueprint optimization


Phase 2: Auto-Failover

Process to elect a new master: On peer disconnect

For each shard the peer was master of (known by checking the blueprint and master failover table):

  1. If the table has a number of write acks greater than half the number of replicas (rounded up), we can attempt failover without worrying about data divergence
  2. Propose a new master from the list of connected peers hosting a replica for that shard
  3. Leader polls each replica for the shard to ensure that contact to the peer is gone
  4. If at least half the replicas (rounded up) cannot contact the peer, commit the change to the master failover table

On peer reconnect

For each shard the peer should be master of, but isn't (known by checking the blueprint and master failover table):

  1. Propose that the peer becomes master
  2. Leader polls each replica for the shard to ensure that contact to the peer is restored
  3. If at least half the replicas (rounded up) can contact the peer, commit the change to the master failover table

Phase 3: ReQL Cluster API


Phase 4: Minimal downtime on blueprint change


Phase 5: Blueprint generation without full connectivity


Phase 6: More Directory and Blueprint optimization


As you can see, there are a number of open questions. The biggest are: Phase 2: Which consensus library to use and how to integrate it into our clustering system Phase 3: ReQL Cluster API description Phase 3: How to interface a virtual table with the rest of the ReQL protocol Phase 5: Feasibility and architecture design

Phases 1 and 6 are optimization changes, but phase 1 is there because it should be relatively easy and hits one of the biggest bottlenecks in the current state of things. Phases 2 and 3 are necessary for a production-ready product. I wouldn't say phases 4 and 5 are necessary for production-readiness, but I would strongly recommend them.

danielmewes commented 10 years ago

@Tryneus This sounds like a great plan.

Depending on how difficult phase 5 is, we should probably do it earlier because it impacts users a lot.

timmaxw commented 10 years ago

Phase 1 sounds like a really good optimization. When removing "nothing" from the directory, there is a small complication to watch out for. When the current system is switching between two roles, it temporarily has no directory entry. For example, when it is switching from "primary" to "secondary", it first removes the "primary" entry and then adds the "secondary" entry, so it briefly has no entry at all. It's important to distinguish between this case and a true "nothing". This should be easy to fix by inserting a placeholder during the switch-over, or perhaps by leaving the old entry until the new entry is ready.

Your description for Phase 2 is not very clear, and a lot of details are missing. Would you mind writing it up in more detail? Perhaps it should be a separate issue, to keep the discussion organized.

You propose to move the web UI into a separate process (presumably Python) in Phase 3. I suggest you consider moving the suggester and auto-failover logic into the separate process as well. Then, if you do Phase 3 before Phase 2, the consensus system could be integrated in Python rather than C++, which might be nicer.

coffeemug commented 10 years ago

You propose to move the web UI into a separate process (presumably Python) in Phase 3

I don't think @Tryneus proposed that. I think he meant implementing a ReQL API to conveniently manage the cluster via client drivers, and then switching the WebUI and the Admin UI to use this API instead of writing to semilattices directly. That doesn't require moving the WebUI out.

timmaxw commented 10 years ago

Oh! I thought he was proposing to partially implement the idea from #1913. Oops.

Tryneus commented 10 years ago

I think #1913 is a good idea, and it could even fit into one of the phases I proposed above, but it is also orthogonal to a lot of these features. It would be really nice if we could use some higher-level constructs for dealing with the goals, blueprints, and even consensus (the libraries available for C/C++ are pretty sparse), but at the same time, it wouldn't necessarily affect the user experience for some time.

timmaxw commented 10 years ago

I agree. The advantage of doing it soon is that the longer we wait, the more code we have to write in C++ and later port to Python. The disadvantage is that it will take a lot of time, and there is no direct payoff in terms of user experience or making any particular issue easier to solve. So the correct answer depends on the development schedule, which I don't know very much about. I just wanted to suggest that you keep it in mind. :smiley:

mlucy commented 10 years ago

I'm extremely skeptical of outside paxos libraries.

libpaxos3 looks like it was written by some dude (http://atelier.inf.usi.ch/~sciascid/), has like 4 simple tests (https://bitbucket.org/sciascid/libpaxos/src/20414d195443e9fe82973f0a0be8c5a3bd24e954/unit/?at=master), doesn't appear to have a bug tracker, had a super-low-traffic mailing list (http://sourceforge.net/mailarchive/forum.php?forum_name=libpaxos-general), and has bug reports in said mailing list that don't seem to all be resolved.

mlucy commented 10 years ago

If we are forced to choose an external library, we should also look at https://github.com/logcabin/logcabin . It claims it isn't ready for production use, which I honestly take as a positive signal in this case, and it's written by Diego Ongaro, who's one of the authors on https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf .

josephglanville commented 10 years ago

@mlucy I would also vote for Logcabin. It's the reference implementation of Raft. You wouldn't need to use the entire implementation, it would be feasible to just extract the Raft algorithm (1200LOC or so) and implement your own log + snapshotting.

coffeemug commented 10 years ago

Also note #2083 is a part of this.

neumino commented 10 years ago

I talked a little with @timmaxw last Monday and asked him a few questions. I don't remember all of them, but here are a few (feel free to open an issue if they are relevant)

I'm pretty sure there was other stuff, but I somehow can't remember. I'll add another comment if something comes up.

danielmewes commented 9 years ago

The ideas in here have pretty much been translated into the ReQL admin interface that we shipped with 1.16 and the Raft rework that @timmaxw has been working on for the past months.

There are some separate issues (query routing, hash sharding etc.) that are already tracked elsewhere.

I think this thread has outlived its usefulness.