jepsen-io / maelstrom

A workbench for writing toy implementations of distributed systems.
Eclipse Public License 1.0
3.04k stars 189 forks source link

Specialized nodes #8

Open Vanlightly opened 3 years ago

Vanlightly commented 3 years ago

I was interesting in modelling Apache BookKeeper, to compare the experience of my modelling it with TLA+. One slight difficulty with BK is that there are 3 node types: metadata_store (zk), bookie (storage node) and bk_client (serving layer where all the consensus logic is). The bk_clients are the only externally visible nodes of the protocol, all requests go to the bk_clients which in turn interact with the metadata_store and bookies.

Currently all nodes are treated the same by Maelstrom, meaning I will need to use proxying of messages, and use node ID to determine node type. When a node initializes I could ensure that the first is a metadata_store, the second two are bk_clients and the rest are bookies. Metadata_store and bookie nodes will need to proxy maelstrom messages to bk_client nodes. But this would mean adding some communication between nodes that does not exist in the real protocol, so each node would always be able to forward messages to the right node.

An alternative is allow node specialization, or at least, mark which nodes form the public API surface of the protocol, in order to avoid this proxying. For my purposes, simply restricting the public API calls to particular nodes would be enough. Using node ID to determine node type is not an issue for me.

aphyr commented 3 years ago

That's a good question, and it's not something I have a great answer for--there are lots of easy ways we could build node specialization, but I'm not sure which way would be easiest and most general for folks going forward. I also know that later on we're going to want some notion of "what datacenter does this node belong to", and that might be communicated in the same way as node roles. One option is to add datacenter and roles fields to the init message, or add specialized init_datacenters, init_roles messages... but before going there, we have to think about what node roles mean, and how they'll be used.

Node roles are tricky because they have semantic meaning not only for the nodes themselves but also for the workload! In your bookkeeper test, you want to have clients only send requests to specific roles. But different algorithms are going to have different ideas about what kinds of roles there are. It'd be weird to, say, hardcode bookkeeper-specific roles into a general-purpose ledger workload.

We might consider having two separate-but-overlapping concepts of roles, where maelstrom has some concept of, say, "nodes which can write", "nodes which can read", "nodes which accept no requests", and then the implementation is free to extend that structure from there, but it's sort of... tricky, right? How many of each should you assign? How should they be distributed across datacenters? We need some way to specify that sort of thing, and I'm not sure what the right shape is.

So... in light of this uncertainty, I have two possible paths that you might want to follow.

One is that if Maelstrom's workloads are working for you as-is, you just... do the proxying. It's like ten lines of code, and that frees you to assign node roles internally, as opposed to having to inform Maelstrom of how to assign and interpret those roles. Note that you can reply to a client's message from ANY node, so it's perfectly valid to wrap up the whole request you receive in a type: proxy body, send it to whatever other node you like, and have it evaluate the proxied message and reply directly to the Maelstrom client. I think this is your best bet.

The other option is that you modify Maelstrom to define a workload specifically for testing Bookkeeper, and follow the broadcast workload's approach. You compute the bookkeeper-specific roles you'd like at the start of the test, and send a message to each node informing it of the roles you computed, and then generate operations based on those roles. That strongly couples the workload to bookkeeper itself, so it's not necessarily a workload you can reuse for other algorithms, but it might be advantageous for other reasons--you might want to, say, design a checker that verifies particular role-related invariants, and knowing what the bookkeeper roles are in the workload would let you do that.

Vanlightly commented 3 years ago

Indeed generalising node roles is not an easy thing to design well. I think for now I will play with using existing workloads and proxying. BookKeeper is after all, log storage, and I can project anything on top of a log, like a KV store. In terms of checkers, there are things like checking that metadata, bookies and clients do not diverge, but any divergence that I care about (data loss and ordering) would be visible from existing checkers.

aphyr commented 3 years ago

I've thought a little more about this--I think that for model-checking purposes, it might be nice to be able to require maelstrom itself as a library, and hook in your own custom workload somewhere. I'm not totally sure what that's going to look like, yet, but I'm keeping it on the back burner.