More efficient support for linearizable reads

heidihoward commented 1 year ago

I believe the only way to achieve linearizable reads at the moment is for the app logic to create a phantom write to the kv to ensure that the primary replicates and reaches consensus on the associated txn. Without this, the read might be stale, even if forwarding is set to always and the client waits until the read txn ID is committed. This is because the primary who served the read may no longer be the current primary.

This is related to but somewhat separate from the discussion around wait for commit.

Here's some early ideas about how to make linearizable reads more efficient:

Allow txns with empty write sets - this probably the simplest approach. The app logic would still need to create a transaction for the read but at least with no write set, the transaction will have few conflicts and be smaller. It's a cleaner solution but I'm not sure how much is gained by this approach
Return the last sequence number +1 to the client for linearizable reads. This means that the client will not consider the read committed unless the primary commits another transaction (in the same term). The key issue here is that if the previous txn is a signature and no other write txns arrive then the client could be waiting indefinitely. This could be worked around by forcing another signature or using (1)
Implement PreVote or some type of leader stickiness such that if we trust clocks then the primary can serve linearizable reads without a round trip to a majority
Implement wait for commit and then delay responses to linearizable reads until after the next successful AppendEntries. This is closest to existing systems like etcd

_Aside: This could be an interesting to model in TLA+ using a separate high-level specification such as https://github.com/Azure/azure-cosmos-tla/blob/master/general-model/cosmos_client.tla_

lemmy commented 1 year ago

_Aside: This could be an interesting to model in TLA+ using a separate high-level specification such as https://github.com/Azure/azure-cosmos-tla/blob/master/general-model/cosmos_client.tla_

The latest versions of the Cosmos specs are hosted at https://github.com/tlaplus/azure-cosmos-tla/ (including a completely new spec discussed in Understanding Inconsistency in Azure Cosmos DB with TLA+). Lorin Hochstein's Linearizability spec could also prove useful.

eddyashton commented 1 year ago

This is related to but somewhat separate from the discussion around wait for commit.

I think we'd need a phantom write and wait-for-commit behaviour to produce a linearizable read? If we only add a write to a read operation, then it could still be executed over stale state on an isolated (no longer current) primary. It would at least get a fresh transaction ID, that could later be recognised as invalid/uncommittable, but I don't see how that helps the initial response linearizability, since that read result may report stale state.

heidihoward commented 1 year ago

Agreed, the client must wait for commit for a linearizable read (or write for that matter). This comment was referring to the discussion around CCF withholding client responses until the txn has been committed.

achamayou commented 1 year ago

I don't think we need a phantom write to detect the next instance of a successful AppendEntries quorum. If we maintained in a counter next to the CheckQuorum response time for example, we could just check for that going up following commit?

heidihoward commented 1 year ago

Yes, a successful appendEntries to a majority, even if just a heartbeat, is sufficient for the leader to know that a read is linearizable

eddyashton commented 1 year ago

To use the AppendEntries roundtrip, do we need to include a new counter value in AEs (and responses)? I'm specifically worried about delayed arrival of ACKs. You can't currently distinguish a heartbeat-ACK emitted before a user's request arrived, from one emitted after. These are misleading but non-fatal for CheckQuorum, but if a node believes an ACK is fresh (since user request) when it is not, that could break linearizability?

heidihoward commented 1 year ago

To use the AppendEntries roundtrip, do we need to include a new counter value in AEs (and responses)? I'm specifically worried about delayed arrival of ACKs. You can't currently distinguish a heartbeat-ACK emitted before a user's request arrived, from one emitted after. These are misleading but non-fatal for CheckQuorum, but if a node believes an ACK is fresh (since user request) when it is not, that could break linearizability?

Yes you're right. The leader will need to ensure that the AE (heartbeat or otherwise) was requested after the read-only transaction was received

heidihoward commented 1 year ago

An example of the non-linearizability of read-only transactions is now provided in the TLA+ CI https://github.com/microsoft/CCF/actions/runs/6381743027/job/17318854604

microsoft / CCF

More efficient support for linearizable reads #5636