sofastack / sofa-jraft

A production-grade java implementation of RAFT consensus algorithm.
https://www.sofastack.tech/projects/sofa-jraft/
Apache License 2.0
3.57k stars 1.14k forks source link

Question about split command of RheaKV #851

Open alievmirza opened 2 years ago

alievmirza commented 2 years ago

Your question

I have a question about RheaKV architecture.

This is unclear to me, how the following situation is handled:

Imagine that we have several puts (let's call them [put1, put2]) to some range A (to be more precise, these puts are related to the right part of range A, let's say that this range is called A2), but these commands haven't been committed to the corresponding raft group of a Region.

At that moment we receive split command, so range A will be splitted to A1 and A2, which means that a new raft group will be created for A2.

At this time we have new puts ([put3, put4]) that are related to the range A2, and to a new raft group for that range. How will the uncommitted puts [put1, put2] be handled? To be more precise, in which order put1, put2, put3 and put4 will be applied and saved to RocksRawKVStore?

I was expected, that before we update the info about new ranges in PD, we somehow ensure that all uncommitted changes before the split command will be applied to SM of replicas, but I couldn't find a place in the code with such logic.

Seems, that I miss some key ideas of replication layer of RheaKV, I would be glad if you could help me to understand this case. Thank you in advance!

fengjiachun commented 2 years ago

Hi @alievmirza

The split implementation is here. rheakv is splitting by sending and executing a raft log, and we know that raft logs are executed sequentially, so whether put [put1, put2] is executed or not, their relative order in the raft log and the split command is the same in all nodes.

alievmirza commented 2 years ago

@fengjiachun Thank you for your answer!

Let me a bit change my example. Imagine, that we have [put1, put2] right after we applied split task to raft log, and we also have [put3, put4] right after we updated RegionRouteTable in PD (here) means that we have already created new raft group for new region, and those [put3, put4] will be handled by new raft group. But what will happen with [put1, put2]? As far as I can understand, they will be handled by the old raft group of the whole region, hence we don't now the order of applying [put1, put2, put3, put4]? What am I missing?

Also, could you please explain the purpose of com.alipay.sofa.jraft.rhea.storage.RocksRawKVStore#initFencingToken?

fengjiachun commented 2 years ago

You mean to execute in the following order?

  1. do split
  2. [put1, put2]
  3. [put3, put4]

If I got that, you are asking how else can we guarantee such an execution order under such conditions: [put1,put2,put3,put4]

My answer is that it is not guaranteed, because the sequential nature is only valid for the same raft group,and we should not rely on this sequential nature.

FencingToken is for DistributedLock, and the value(token) of the newly region must be larger than its father's value for the distributed lock to work fine. Some details here

alievmirza commented 2 years ago

@fengjiachun Thank you for your answer! You've got my idea right.

Correct me if I'm wrong, does it mean, that from a client's perspective, it is possible that its consecutive puts for the one key could be reordered, if PD decides to make split of region?

I was wondering if you know whether or not TiKV has the same behaviour when split happens?

fengjiachun commented 2 years ago

@alievmirza

Now I really understand what you mean.

What I mean is: [put1, put2, put3, put4] -> [put(key1), put(key2), put(key3), put(key4)]

If [put1, put2, put3, put4] is on the same key, then the order must be guaranteed and the splitting does not affect the apply order of the same key

So, that is a bug, [put1, put2] will be written to the old raft group.

alievmirza commented 2 years ago

@fengjiachun

So, to sum up, does my case with several puts on the same key after split can break consistency in the current implementation in RheaKV? Should I create a separate issue for that?

fengjiachun commented 2 years ago

Please create a separate issue for that, Thank you!