Open yuer1727 opened 5 years ago
No, not planned.
Though if you're interested, I'm still open to contributions; if you can work it out in a simple way.
Here's a simple idea for clustering, correct me if I am missing something:
Implement a "proxy" in front of 2+ nodes of sonic which would do two things:
Bonus badass points for every node being able to act as such proxy elected by consensus algo instead of appointed-once proxy (which would become a single point of failure).
That would be a pragmatic solution, but I'd be more open to a built-in replication system in Sonic, that may use the Sonic Channel protocol to 'SYNC' between nodes.
With the proxy solution, how would you implement synchronization of 'lost' commands when a given node goes offline for a few seconds? Same as for new nodes, with a full replay from the point it lost synchronization? A replay method would also imply that the proxy stores all executed commands over time, and ideally periodically 'compact' commands when eg. you have a PUSH and then a FLUSHB or FLUSHO that would cancel out the earlier PUSH.
A built-in Sonic clustering protocol could use database 'snapshots' to handle recoveries (I think RocksDB natively supports this), and only re-synchronize the latest part that's been lost on a given node due to downtime. I'll read more on how Redis and others do this as they have native clustering.
A nice side effect of clustering, is that one could built a large Sonic-based infrastructure, with several "ingestor" nodes and several "search" nodes.
I like the things you are proposing!
In any shape, clustering for fault-tolerance is a production-ready requirement – at least for us.
May I suggest to re-open this issue, even as an item for a roadmap item?..
Sure, leaving it open now.
I think The Raft Consensus Algorithm will do great addition to Sonic.
Actix Raft can be integrated with Sonic to add clustering capability
Also Distributed Search idea can be borrowed from Bleve which is used by Couchbase
We see many products built on rocksDB use RAFT successfully. (CoackroachDB, Arangodb etc)
For simplicity's sake, the replication protocol will be inspired by Redis replication protocol: a primary write master, followed by N read slaves (ie. read-only). If the master falls down, reads are available on slaves but writes are rejected. When the master is recovered, slaves catch up to the master binlog and writes can be resumed by the connected libraries (possibly from a queue).
my two cents , you may consider CRDT model used by Redis replication. 1) Active-Active database replication will ensure Writes are never rejected 2) All the servers are used and you make most of all the servers 3) since there is no transaction and it is only search data -- there is no risk of data loss
As a stop-gap workaround for missing clustering functionality, would it be possible to use a distributed file system for synchronizing changes between multiple Sonic instances and then (via a setting, somewhere) configure only one of the Sonic instances to have write access, leaving the other ones to be read-only?
I don't know if it would work to have multiple Sonic instances reading from the same files on disk, and what would happen to one Sonic instance if another instance writes to the files - but if that part works somewhat, it should "only" be a matter of implementing a read-only setting for this to work, I think.
That would unfortunately not work as RocksDB holds a LOCK file on the file system. Unsure if RocksDB can be started in RO mode, ignoring any LOCK file. Can you provide more reference on that? If it's easily done, then it's a quick-wine.
However, properly-done clustering needs full-data replication, meaning that a second independant server holds a full copy of the data on a different file system. That way, if the master fails, then the slave can be elected as master, and the old master can be resynchronized from the old slave in case of total data loss on one replica.
It looks like RocksDB can be opened in read-only mode. There's a page here that specifically talks about this: https://github.com/facebook/rocksdb/wiki/Read-only-and-Secondary-instances.
Excellent, that can work then.
may be support databases like TiDB ( built on rocks db) and extend the features of TiDB Cluster / replication
Would it be worth it to move this to 1.5.0 milestone? I am open to taking a look at this and sending a PR based on this comment.
Sure! Totally open to moving this to an earlier milestone. Open to PRs on this, thank you so much!
I looked at the code a bit this evening. As I understand it, all the DB references (whether KV or FST) are opened lazily using the acquire functions in StoreGenericPool
which the specific implementations call when there is an executor action requiring the use of the store. Tasker also accesses the stores lazily when doing its tasks.
I am assuming its straight forward enough to utilize the RocksDB secondary replicas as they allow catchup. When I say readonly below, I mean the Sonic replica, not RocksDB readonly replica.
My main question for now is to do with the syncing of the FST store? That doesn't come for free like RocksDB syncing. The executor currently keeps KV and FST stores in sync by performing pushes/pops/other operations on both stores in tandem. Writer will continue to do so, but for readers, should there be an evented system to broadcast all successfully completed write operations to them? On one hand, this approach does sound really complicated to implement and I wonder if constructing the FST by looking up all the keys in the synced KV store is an option on the reader's side?
Anyone else here who has thoughts about this? (I am a newbie 😄 in both Rust and Sonic).
Hello there,
On the FST thing, as they are guaranteed to be limited in max size per FST, we could think about having a lightweight replication protocol work out insertions/deletions in the FST. If the replication loses sync, we could stream the modified FST files from master disk to slave disk, using checksums for comparison. Note that however, to guarantee integrity this would require to also send the not-yet-consolidated FST operations to the slave, so that next CONSOLIDATE on master and slave result in the same FST file being updated with the same pending operations.
Does that make sense?
hello valeriansaliou: have you any plan to develop multi nodes version? I mean that highly available and scalable feature is import to be popular and be used in product environment.