I have found a sequence of events that could lead to repeated failure of a ledger recovery process that would lead to data unavailability. I have verified this with both my TLA+ specification (https://github.com/Vanlightly/bookkeeper-tlaplus) of the replication protocol and new unit tests.
BUG REPORT
When a second writer performs recovery, it can end up trying to create an invalid ensemble which will cause recovery to fail.
The following example involves no timeouts on LAC reads, but an empty ensemble returning -1 for LAC reads. However, the same result would follow if LAC reads timed out, causing the value of -1 to be used as the default.
Any ledger recovery is vulnerable to this situation if the ledger has at least 1 existing ensemble, a default of -1 is used for the LAC and an error occurs during the write-back phase causing a new ensemble to be created (from entry 0).
THE FIX
The fix is to not use -1 (or 0 in the TLA+ spec) as the minimum but to take the first entry id of the current ensemble - 1 as the minimum. This ensures we only try to recover the current ensemble. Previous ensembles, if any, have already been committed and so recovery reads/writes of those ensembles is unnecessary. If a failure occurs during recovery of the current ensemble, then updating that ensemble is a legal operation.
The TLA+ specification uses this new minimum value and so the spec will not currently reach this illegal state.
I will shortly submit a PR with the code fix and new unit tests.
Original Issue: apache/bookkeeper#2615
I have found a sequence of events that could lead to repeated failure of a ledger recovery process that would lead to data unavailability. I have verified this with both my TLA+ specification (https://github.com/Vanlightly/bookkeeper-tlaplus) of the replication protocol and new unit tests.
BUG REPORT
When a second writer performs recovery, it can end up trying to create an invalid ensemble which will cause recovery to fail.
The following example involves no timeouts on LAC reads, but an empty ensemble returning -1 for LAC reads. However, the same result would follow if LAC reads timed out, causing the value of -1 to be used as the default.
Any ledger recovery is vulnerable to this situation if the ledger has at least 1 existing ensemble, a default of -1 is used for the LAC and an error occurs during the write-back phase causing a new ensemble to be created (from entry 0).
THE FIX
The fix is to not use -1 (or 0 in the TLA+ spec) as the minimum but to take the
first entry id
of the current ensemble - 1 as the minimum. This ensures we only try to recover the current ensemble. Previous ensembles, if any, have already been committed and so recovery reads/writes of those ensembles is unnecessary. If a failure occurs during recovery of the current ensemble, then updating that ensemble is a legal operation.The TLA+ specification uses this new minimum value and so the spec will not currently reach this illegal state.
I will shortly submit a PR with the code fix and new unit tests.