treeverse / lakeFS

lakeFS - Data version control for your data lake | Git for data
https://docs.lakefs.io
Apache License 2.0
4.33k stars 342 forks source link

[CosmosDB] Consistency level is too low #6642

Open arielshaqed opened 11 months ago

arielshaqed commented 11 months ago

The CosmosDB KV code uses "bounded staleness" consistency (or a bit lower in tests using a local emulator). This was already discussed in the review of

5915, so we need to read carefully here...

I could find nothing about scan-after-write; this is potentially important for our list-after-write consistency in staging.

Really all these consistency levels talk about cross-region consistency, except for serializability. Bounded staleness seems to be too low in extreme circumstances. It guarantees "correct" reads in a MongoDB sense, of a single session giving correct results. In the single-region case this is the worrying paragraph:

For a single-region account, Bounded Staleness provides the same write consistency guarantees as Session and Eventual Consistency. With Bounded Staleness, data is replicated to a local majority (three replicas in a four replica set) in the single region.

For session consistency:

In session consistency, within a single client session, reads are guaranteed to honor the read-your-writes, and write-follows-reads guarantees. This guarantee assumes a single “writer” session or sharing the session token for multiple writers.

But we have multiple client sessions on different cluster members. I see nothing about this avoiding stale reads. Indeed this StackOverflow answer says that it can happen with bounded staleness.

So in practice the lakeFS application sees session consistency. But that means different cluster members can see different data. I don't see anywhere that says anything useful about

I suspect that we need strong consistency (unfortunately).

arielshaqed commented 11 months ago

Assignees: please go over it and let me know that I'm wrong :-/

nopcoder commented 11 months ago

Single region write will be replicated for at least 3/4 replicas in the same region. Because of the following:

Reads when using Bounded Staleness returns the latest data available in that region by reading from two available replicas in that region. Since writes within a region always replicate to a local majority (three out of four replicas), consulting two replicas return the most updated data available in that region.

It means we will get read after write - we always get the latest.

(Also based on the notes and the following: https://learn.microsoft.com/en-us/answers/questions/116045/azure-cosmosdb-consistency-understanding-bounded-s)

itaiad200 commented 11 months ago

As the one who wrote the code ;) I agree with @nopcoder .

But the easiest way to settle this is with our kvtest suite. CosmosDB emulator is off by default since linux support is horrible, but it's a matter of replacing creds to make it working against a real CosmosDB database. It would be easy to prove inconsistencies using the suite with the existing tests, or new ones if necessary.

arielshaqed commented 11 months ago

I've been thinking about it some more, and both you and I are correct.

But if single-object consistency is enough, we should be to use a much lower consistency level. The docs do guarantee consistency equivalent to session consistency!

github-actions[bot] commented 8 months ago

This issue is now marked as stale after 90 days of inactivity, and will be closed soon. To keep it, mark it with the "no stale" label.

arielshaqed commented 8 months ago

Keeping open.