Loss of conflicting value on race condition

Let's assume we run with N=3, R=W=2. Let's fix some arbitrary key K and assume 
K is stored on nodes 1, 2 and 3, where 1 is primary master and 2 secondary 
master.

Now, let's assume the following race condition:

- Client A wants to PUT value A for K. Initial vector clock is *empty*. It 
chooses 1 as master and hence A is sent to 1 with vector clock (1:1). Node 1 
gracefully accepts A and persists it.
- Client B wants to PUT value B for K. Initial vector clock is *empty*. For 
some reason, node 1 is not reachable, so it chooses 2 as master and hence B is 
sent to 2 with vector clock (2:1). Node 2 gracefully accepts B and persists it.

Clearly A and B form a conflict and by principle A and B should be saved until 
"manually" resolved by some client (on read-repair).

Client A and B now start to persist the replicas on the other nodes. However, A 
is a bit slow, so client C comes along BEFORE A is persisted on nodes 2 or 3, 
whereas B is already persisted on nodes 1, 2, 3.

- Client C sends a GET request to node 2 and 3 and retrieves value B with 
vector clock (2:1) from both nodes. It does not notice the existence of 
conflicting A (because A is not yet visible on 2 or 3)

Just now, even A manages to store it's value on all replicas, so by now, all 
nodes store conflicting A and B with vector clocks (1:1) resp. (2:1). That's 
fine, the all conflicting values are persisted for future reconciliation.

- Client A and B gracefully finish their PUT requests on all replicas with no 
error message.

However, C now wants to update A, so:

- Client C wants to PUT value C for K with vector clock (2:1). It chooses 1 as 
master and hence C is sent to 1 with vector clock (1:1, 2:1). Node 1 gracefully 
accepts C and persists it -- it does not fire an OVE because C's vector clock 
is AFTER A's and B's because (1:1, 2:1) is AFTER  (1:1) and (2:1).

Hence, node 1, 2, 3 all persist C and delete the clearly obsolete B, but the 
wrongly assumed-to-be obsolete A.

- Client A and B gracefully finish their PUT request on all replicas with no 
error message.

So, the final configuration is that C is persisted at all nodes with vector 
clock (1:1, 2:1), where as A and B disappeared.

It's okay that B disappeared because C had the knowledge of B when writing up 
C. 

But, A just disappeared without any error message and is hence lost.

Original issue reported on code.google.com by derol...@googlemail.com on 19 Jul 2011 at 12:32

The root cause for the problem is that the "client" increments the vector clock 
on behalf of the "master" BEFORE the record is sent to the "master". This way, 
the "master" is unable to check whether the "original" vector clock was coming 
from a conflicting path:

- Conflcting clocks: (1:1) vs (2:1)
- Incrementing (1:1) clock on behalf of master 2, results into (1:1, 2:1), 
which is NOT conflicting with (2:1), but is AFTER (2:1)

Solution:
- Send original clock to master, PUT C with org. VC  OVC=(1:1)
- Master calculates new VC NVC=(1:1, 2:1).
- Master should raise some ConflictWouldBeLost exception IF there is an 
existing value with VC EVC which fulfills: OVC is CONCURRENT to EVC && NVC is 
AFTER EVC

OVC is CONCURRENT to EVC && NVC is AFTER EVC means that that C is actually in 
conflict with some existing value, but it would the existing value would be 
overwritten because NVC is after EVC.

Original comment by derol...@googlemail.com on 21 Jul 2011 at 9:18

nassim-git / project-voldemort

Loss of conflicting value on race condition #354