design persistence synchronization protocol

mmschuler commented 8 years ago

[x] discuss protocol (requirements)
[x] define details
[x] scribble
[x] describe it in the documentation
[x] create diagram

paddor commented 8 years ago

One of the possible variants would use PUB/SUB to bubble TC updates up to the supernode(s).

@arohr What kind of hardware will Roadster run on? What are the resources on the smallest of nodes? How much main memory will there be available?

This is to define a reasonable default for ZMQ_RECOVERY_IVL (multicast recovery interval), which is 10 seconds by default. To make an accurate estimation, we'd like to know the memory pressure and the normal traffic amount. If set too high, available main memory might turn out to be an issue in case a client is disconnected for too long.

Even if we won't actually set a non-default value for this option, it'd be nice to know what kind of nodes Roadster will be running on.

paddor commented 8 years ago

@arohr How important would you classify real-time synchronization (bubbling updates up towards root)? On a scale of 0 to 10, where 0 means don't care, and 5 is nice to have, and 10 is mandatory, what would it be?

arohr commented 8 years ago

Hardware: Typically this is standard entry-level server hardware (e.g. http://www.fujitsu.com/fts/products/computing/servers/primergy/rack/rx1330m2/) or industrial box PCs (e.g. http://nexcom.eu/Products/industrial-computing-solutions/industrial-fanless-computer/core-i-performance/fanless-pc-fanless-computer-nise-3600e) with 4-8 GB RAM.

For smaller systems Atom-CPU based box PCs are used (e.g. http://nexcom.eu/Products/industrial-computing-solutions/industrial-fanless-computer/atom-compact/fanless-computer-nise-105-105a)

For box PCs we typically use one or two industrial grade SSDs (two for software-RAID level 1), so we can build reliable systems without any moving parts (e.g. https://www.syslogic.com/eng/cactus-910s-series-2-5-drive-41285.shtml?parentPageId=58532)

arohr commented 8 years ago

real-time sync: real-time sync of persistent data (timeseries, eventjournals) is mandatory (give it a 10 ;-)), but the question is "what means 'real-time'" here? I would say it is sufficient if it takes up to 30 seconds for new data to make it to the root node.

paddor commented 8 years ago

Thanks for the info. By real-time, we mean "no artificial pauses like sleeps", so a simple, periodic polling mechanism is out of question. We aim for a solution that publishes updates immediately. However, how much time the update messages spend in the sender's/receiver's queue, especially if there's a lot of traffic on the link, is undefined, so it can still take a few seconds until the update is persisted on the supernode. Taking an educated guess, given that there is usually low-traffic for persisted data, I'd say it should be almost instant. Of course, we'll have to measure this.

arohr commented 8 years ago

Hm. Real-time means to me "a guaranteed reaction of a system within a defined time".

I believe designing a protocol the way you propose, is much harder to do than a simple polling mechanism. Since a very low latency is not really a requirement, I would think about twice going this way. But of course, I may just not have THE clever idea in mind...

paddor commented 8 years ago

Good point. That could be the correct meaning. What I said maybe is just called "event-driven". But I don't think we can make a true guarantee on the hard limit, as long as a garbage collector and standard networking equipment are involved.

Our idea was this: Very similar to the Clone State Pattern used for DIM synchronization (which uses DEALER-ROUTER, then PUB-SUB), we plan to use DEALER-ROUTER to get the initial delta, and then PUB-SUB for live updates (possibly to two supernodes). This is described as variant 3 for persistence synchronization on/around page 26 in the current version of the document.

Another drawback of the simple polling mechanism is that it could be computation heavy to repeatedly calculate the diff, depending on how TC can handle this kind of scenario (iterating through keys, skipping all that are older than X?). Of course this could be considered premature optimization. If we go for polling, we'd have to go for maybe 15 seconds, much lower than I originally thought.

Also, in case the supernode is a HA pair, this causes duplicate traffic. This is true for the PUB-SUB mechanism as well, but there we at least have the possibility to use multicast (and compression), in case we hit a bottleneck.

thewesch / ba-roadster-doc

design persistence synchronization protocol #10