node_id ? - Githubissues

dominictarr commented 11 years ago

Suggest including the id of the node that this update originated on.

If this was included, then you could implement master-master replication, Although, you'd have to relax the restriction that seq must be strictly increasing.

You could not guarantee that all seqs are increasing, but you could guarantee that seqs are strictly increasing per node_id.

The node_id could be the hash of the public key of that node or something like that.

Adding a node_id would allow a scuttlebutt protocol to be implemented as a special case of this protocol.

mikeal commented 11 years ago

could we get the same functionality by simply allowing an array for the id?

then the id could be [nodeName, uuid] and whoever exposes the SLEEP endpoint for an aggregation of nodes is responsible for linearly increasing the aggregate sequence of all the nodes.

i think ensuring a linear increasing seq is important, if that isn't ensured then you don't have a consistent way to pull with since= and get a consistent response. it would also make writing clients a lot harder.

dominictarr commented 11 years ago

Oh, hmm. Just remembered the couch db way here.

The couch way is that every node has a seq, and when it recieves updates, it stores it with it's own sequence number. (i.e. your 'seq = 1023' is not necessarily the same update as my seq = 1023)

scuttlebutt is different, each update has a seq that is associated with the node that created it, and that is preserved even after it's been copied to another node.

The couch way is simple to work with when your not replicating, but the scuttlebutt method makes the handshake very simple (when two nodes meet, they can find out what the other one doesn't know with a single exchange)

I don't completely understand how the couchdb handshake works...

Hmm, I think I should write up what I know about replication protocols.

mikeal commented 11 years ago

so, I think there are two different replications we're talking about and while they are different I think we can find a way to do both without them conflicting. the first is inter-node replication and the second is full database replication (cloning the entire dataset shared among nodes).

so, let's say that 4 node's all have part of a larger set of data. when nodes talk to each other the regular sequence index is fine because they know when connecting to each node's SLEEP endpoint that it is unique and that they will have their own individual sequence indexes.

now, say we want to expose the entire set of data, shared by 4 nodes, to someone like @maxogden to replicate and play around with. if we make the since= option optional for server implementers then this is as easy as collating all 4 nodes' SLEEP endpoints, prefixing the ids using an array [nodeName, uuid], and incrementing the exposed sequence index atomically for every entry returned.

this aggregate endpoint could support since= if it stored a small record for the pre-node sequences for each aggregate sequence, but this is an optimization, the most important thing is that someone can clone the whole dataset with SLEEP. we can write extensions later that allow for more fine grained node and sharding information.

SLEEP should enable easier "up to date" semantics when possible but for the smallest and simplest use cases it should scale down to something as simple as "serialize an object and return it with SLEEP" which means that strictly storing a sequence index should not be required.

mikeal / SLEEP

node_id ? #2