Open huahaiy opened 4 years ago
Hi, have you considered FoundationDB as a storage backend to achieve this?
We would like to retain the lightweight nature of this library even when we go distributed. So we will probably retain LMDB for local data storage. FoundationDB could be leveraged for cluster coordination. These two working together has been done before, e.g. https://forums.foundationdb.org/t/success-story-foundationdb-at-skuvault/336/4
So we will definitely consider FoundationDB as a possible option for implementing distributed mode.
FoundationDB seems like a big dependency to add.
How about using a raft clojure/ java library implementation? https://raft.github.io/#implementations
The original plan is to use a Java Raft library, but I am open to other possibilities.
The existing Raft implementations all have some weaknesses, so I will probably end up implementing my own Raft in Clojure that tailored to the needs of Datalevin. Another option is to borrow some ideas from FoundationalDB.
The current implementation is a standalone mode, we will add a distributed mode to allow data replications across multiple nodes.
Design criteria:
Strong consistency. CP in term of CAP theorem: transactions have a consistent total order; support linearziable reads; support dynamic cluster membership. Pass Jepsen tests.
We choose CP because a fast failing (unavailable) system is simpler to program around than a system that sometimes produces wrong results. Simplicity for users is the main design objective for Datalevin. All our design choices, Datalog, mutable DB, and CP are consistent with this goal.
Implementation will use Raft consensus algorithm:
Any node can read/write.
Write in total order: first goes to the leader, the leader first writes to a transaction log, sends the write to the followers, waits for quorum confirmations, then commits the write and reports success. Write is unavailable if quorum cannot be reached.
User can choose one of the three read consistency levels:
By default, read goes to leader to check if this node has the latest data, the leader then asks the followers to obtain quorum confirmation, and replies to the read requester. If confirmed, the reader reads from its local LMDB. This provides linearizable reads.
If leader lease is enabled, the leader doesn't ask the followers, just reply. This level requires the nodes to have clock synchronization. This saves quorum confirmation so it is a bit faster way of achieving linearizable reads.
Optionally, if the user doesn't mind reading stale data, she can choose to bypass the leader and read from local LMDB directly. This has the same read speed as standalone mode, but the data may be outdated.
Raft is a much better solution than the designated transactor concept of Datomic. In raft, the leader is elected, not fixed. With raft, the same transaction total order is achieved without the cost and complexity of operating designated transactors. Even with a standby, transactors are still a single point of failure.
Also, Datomic doesn't seem to have mechanism to ensure linearized reads. "database as a value" does't say the value is the latest version. It could well be an outdated version. The main supported storage backend of Datomic, DynmoDB, is AP only. So consistency is not guaranteed.
We will apply raft globally in the cluster, which means that the cluster size is not unbounded. Since sharding should be something handled on the application level, as the application has more context, the database should not automate sharding. So if unbounded scaling is needed, instead of "just adding more nodes", just run more clusters.