[Use Help] Is it reasonable to use raft to replicate data in a ultra-low latency storage system?

rpbear88 commented 3 years ago

Hi dragonboat,

We are building a ultra-low latency storage system which aims to make the whole IO path latency under 50us for 4KB workloads(the network is RDMA and disk is NVMe level SSD or optane SSD).

We benchmarked many raft implementations which is written with C++ and the performance is not good, some of them even cost more than 1ms for QD=1(very low workloads pressure). While some guys pointed out that maybe raft is not a good way to replicate data with 4KB or more big size, the sequencial logging period(writing WAL) can't leveage the fastest storage media since the media like NVMe SSD supports very high IO parallism and we just submit logs(even in batch) one after another.

I think there is same ways to overcome that issue, one of them I call it parallel WAL support. if we allowed that logs can be written into WAL not in a sequencial way, then WAL can leveage the underlie storage hardware.But if the logs comes to followers out-of-order, the submit will be hung since it can't figure out which WAL offset to record the later logs, if we want to overcome this problem, fixed log size maybe a good solution.

How can we make the log to be fixed size? I think one way is that we only record metadata in log without recording data, which means data and metadata shoule be replicated in different ways. We can replicate data in a very simple way like the leader just send request to followers, and followers write the data into final places(which requires the storage engine to do write in an out-of-place strategy).At the same time, leader try to replicate the metadata which describes the data with raft, the write is considered as successfull if and only if data and metadata are all replicated successful.

Do you think this is a right way to replicate data in an ultra-low latency storage system while still achieve data consistency?

lni commented 3 years ago

While some guys pointed that maybe raft is not a good way to replicate data with 4KB or more big size, the sequencial logging period(writing WAL) can't leveage the fastest storage media since the media like NVMe SSD supports very high IO parallism and we just submit logs(even in batch) one after another.

I don't think it is a Raft problem. For a single raft group, you don't have to complete the first Raft log write before starting the next Raft log write. Different implementations including Dragonboat may put restrictions for various reasons, but such implementation decisions have really nothing to do with the Raft protocol itself. Secondly, the high IO parallelism of your storage devices can be fully leveraged by running many Raft groups in parallel, that is actually how most Raft/Paxos based systems are built.

rpbear88 commented 3 years ago

I don't think it is a Raft problem. For a single raft group, you don't have to complete the first Raft log write before starting the next Raft log write

Yes, this is correct, the worst case is that if the log entry is not in fixed size, when the log are arrived at followers out-of-order, the logs which are arrived earlier can't be written into WAL since their offset in the WAL can't be calculated with missing entries. While I still believe this is useful since all the arrived entries can be logged without waiting previous WAL write operations.

Different implementations including Dragonboat may put restrictions for various reasons

I'm really curious about the reasons which bring those restrictions, would you please list some of them?

Secondly, the high IO parallelism of your storage devices can be fully leveraged by running many Raft groups in parallel, that is actually how most Raft/Paxos based systems are built.

Thanks, this is a very good point. It seems that if we take that way, when the underlie hardware becoming more and more powerful, it's highly likely that one node will have a huge number of raft groups. Since dragonboat has the multi-raft groups support, the performance issue should not become a big problem, I still wonder if there is any other major issue when the business system facing so many groups like the statistics which is helpful for workload balance? Any suggestions/guidances about this? thanks in advance.

lni / dragonboat

[Use Help] Is it reasonable to use raft to replicate data in a ultra-low latency storage system? #185