streamnative / bookkeeper-achieved

Apache Bookkeeper
https://bookkeeper.apache.org
Apache License 2.0
3 stars 2 forks source link

ISSUE-2615: Ledger recovery can fail due to invalid ensemble creation #337

Closed sijie closed 3 years ago

sijie commented 3 years ago

Original Issue: apache/bookkeeper#2615


I have found a sequence of events that could lead to repeated failure of a ledger recovery process that would lead to data unavailability. I have verified this with both my TLA+ specification (https://github.com/Vanlightly/bookkeeper-tlaplus) of the replication protocol and new unit tests.

BUG REPORT

When a second writer performs recovery, it can end up trying to create an invalid ensemble which will cause recovery to fail.

The following example involves no timeouts on LAC reads, but an empty ensemble returning -1 for LAC reads. However, the same result would follow if LAC reads timed out, causing the value of -1 to be used as the default.

participant w1
participant w2
participant b1
participant b2
participant b3
participant b4
participant b5
participant b6
participant zk

note over w1,zk:Ensembles: 0:{b1,b2}, 1000:{b2,b3}
w1->b1: e1999
w1->b2: e1999
b1->w1: ack e1999
b2->w1: ack e1999
w1--xb1: e2000
w1--xb2: e2000
w1->zk:Add ensemble 2000:{b4,b5}
w2->zk:Start recovery
w2->b4: Read LAC
w2->b5: Read LAC
b4->w2: LAC -1
b5->w2: LAC -1
w2->b1: Read e0
w2->b2: Read e0
b1->w2:e0
b2->w2:e0
w2->b4:e0
w2--xb5:e0
b4->w2:ack e0
w2->zk:Add ensemble 0:{b4,b6}
note over w2,zk:INVALID ENSEMBLE

Any ledger recovery is vulnerable to this situation if the ledger has at least 1 existing ensemble, a default of -1 is used for the LAC and an error occurs during the write-back phase causing a new ensemble to be created (from entry 0).

THE FIX

The fix is to not use -1 (or 0 in the TLA+ spec) as the minimum but to take the first entry id of the current ensemble - 1 as the minimum. This ensures we only try to recover the current ensemble. Previous ensembles, if any, have already been committed and so recovery reads/writes of those ensembles is unnecessary. If a failure occurs during recovery of the current ensemble, then updating that ensemble is a legal operation.

The TLA+ specification uses this new minimum value and so the spec will not currently reach this illegal state.

I will shortly submit a PR with the code fix and new unit tests.