Provide Consistency Guarantees using Snapshot Tokens

zepatrik commented 3 years ago

The Zanzibar paper describes the so-called zookies to be used for evaluating checks at object versions. They use Googles true time feature to have global timestamps, which will probably not be available everywhere. The tokens are opaque and could encode arbitrary data that can be used by Keto to guarantee that it considers all relation tuples "at least as old as the object verstion". On an object update, Keto issues a new token that will then be stored by the application together with the corresponding object version. It might be possible to use some form of logical vector clock (e.g. bloom clock) as a token, or rely on cockroach to check whether the system time is within a specific window on each instance and get the time from there.

zepatrik commented 3 years ago

Also found an interesting clock algorithm that is based on a logical ring of nodes, as we might use anyway for consistent hashing @jon-whit https://www.researchgate.net/publication/221134159_An_Efficient_Implementation_of_Vector_Clocks_in_Dynamic_Systems

jon-whit commented 3 years ago

@zepatrik interesting! I did a quick pass over the paper and it looks quite interesting and promising. I need to read it with way more detail to fully understand though.

zepatrik commented 3 years ago

On a special type of check request, the content-change check, zanzibar issues a zookie of the latest snapshot.

The zookie encodes the evaluation snapshot and captures any possible causality from ACL changes to content changes, because the zookie’s timestamp will be greater than that of the ACL updates that protect the new content

All it tells us that the data we need for ACL evaluation are propagated already, so we can proceed right? If you would store, let's say, a UUIDv4 together with the relation tuples affected by an ACL update, you can determine whether those data are propagated based on whether the lookup succeeds. For deleting events this would mean adding a tuple that represents the deleted tuple with a new UUID (lets call that gost tuple). By looking up the UUID from the zookie, we can determine whether the data were updated already. For causally related events this would also work I think because you can only apply an event if the previous events UUID lookup succeeded (i.e. the previous event propagated). This way we implicitly build a chain of related events. Issuing events with no zookie on the other hand would start a new event chain, i.e. be concurrent in terms of distributed systems. The client always has to make sure that requests include the right zookie.

Example:

new object is tracked in ACL, it's initial relation tuples are added to Keto without a zookie
1. assign new UUID1 to all the new relation tuples (or ghost tuple) and give that back as the zookie1
a check request is made against the object (includes zookie1)
1. look up whether there is any match for UUID1, if yes handle request, if no forward to other nodes
a content-change check request is made against the object without a change to the ACL (includes zookie1)
1. look up whether there is any match for UUID1, if yes handle request, if no forward to other nodes
2. create new UUID2 and update all tuples with UUID1 to UUID2
3. add an entry to a lookup table where UUID1 is marked as obsoleted by UUID2 (to be determined whether needed)
4. respond with UUID2 as zookie2
a content-change check request is made against the object with a change to the ACL (includes zookie2)
1. look up whether there is any match for UUID2, if yes handle request, if no forward to other nodes
2. create new UUID3 and assign it to the new tuples (or ghost tuple if only deletion)
3. update all tuples with UUID2 to UUID3
4. add an entry to a lookup table where UUID2 is marked as obsoleted by UUID3 (to be determined whether needed)
5. respond with UUID3 as zookie3
read/expand requests are made (includes zookie3)
1. look up whether there is any match for UUID3, if yes handle request, if no forward to other nodes
write requests will probably only be possible through content-change checks with ACL changes

All steps are executed as one transaction.

Concurrent transactions should fail because they are updating the same tuples and therefore result in a conflict.

The lookup table for old UUIDs is required to determine whether a "old" zookie is obsoleted by another one. This table can probably be kept contained in size because we can handle zookies so old they are fully propagated by just not applying the check step.

Maybe we can also have a separate table for events instead of having the event IDs in the relation tuple table(s).

Is there anything I missed?

jon-whit commented 3 years ago

Wouldn't this strategy require that, when writing a relation tuple or executing a content-change check, a UUID would have to be generated and written for any relation tuple involved in the graph of relations traversal? Including those through userset rewrites. And if so, this approach seems like it'd be quite a latent approach.

Consider the scenario where a check is dictated by a relation tuple in another namespace through a userset rewrite rule (tupleToUserset) or where a check's outcome is dependent upon a computerUserset within the same namespace. In these scenarios you'd have to traverse the graph of relations and write the UUID for all tuples involved in the traversal. These writes would also have to be replicated across the majority of replicas in the database cluster, which would further slow down the latency of the check request.

These are the reasons why Spanners snapshot reads with bounded staleness alleviate this issue. You can read all of the tuples of the namespaces involved in the graph traversal at a snapshot no staler than the provided zookie timestamp. So the read might incur a delay because the majority of replicas have to agree on the read snapshot, but they don't have any majority replica write delays. The write latency is incurred only when writing a new relation tuple.

I fear this approach would ultimately dominate the tail latency of Keto, and for a multi-region cluster it probably wouldn't be uncommon to see no less that 200ms for db replication.

zepatrik commented 3 years ago

Summary of yesterdays synchronous discussions:

CRDB does not offer, and will likely never able to offer, global ordering of transactions. It's serializable isolation level ensures ordering of overlapping transactions, but not of non-overlapping. This is only possible in spanner because of atomic clocks and GPS attached to it.
Solutions for similar problems are:
1. Assume the synchronization time of writes to CRDB is sufficient and don't handle it/wait for enough time to pass before handling a request. Because nodes that have a clock skew of more than 500ms will not be considered healthy, this is effectively the time we have to wait. CRDB gives us the transaction timestamp that we can use for this approach.
2. Create a "global table" and write to it in each transaction. This is effectively above proposed UUID approach (https://github.com/ory/keto/issues/517#issuecomment-881370496). This will capture updates to the object ACL, but not updates to indirect permissions (think removing someone from a group). This can also be used as an optimization of the previous point for some cases.
3. Use an external time source that is accurate enough and don't rely on CRDB (i.e. require atomic clocks/GPS attached to Keto nodes). This is likely out of scope and will never be implemented.

So next steps would be to investigate further how the individual approaches look in practice. Come up with a POC. Probably it will make most sense to go with the first approach for now and add optimizations to that (like 2. approach) later on when we have exact cases of failure or unacceptable latency.

github-actions[bot] commented 2 years ago

Hello contributors!

I am marking this issue as stale as it has not received any engagement from the community or maintainers a year. That does not imply that the issue has no merit! If you feel strongly about this issue

open a PR referencing and resolving the issue;
leave a comment on it and discuss ideas how you could contribute towards resolving it;
leave a comment and describe in detail why this issue is critical for your use case;
open a new issue with updated details and a plan on resolving the issue.

Throughout its lifetime, Ory has received over 10.000 issues and PRs. To sustain that growth, we need to prioritize and focus on issues that are important to the community. A good indication of importance, and thus priority, is activity on a topic.

Unfortunately, burnout has become a topic of concern amongst open-source projects.

It can lead to severe personal and health issues as well as opening catastrophic attack vectors.

The motivation for this automation is to help prioritize issues in the backlog and not ignore, reject, or belittle anyone.

If this issue was marked as stale erroneous you can exempt it by adding the backlog label, assigning someone, or setting a milestone for it.

Thank you for your understanding and to anyone who participated in the conversation! And as written above, please do participate in the conversation if this topic is important to you!

Thank you 🙏✌️

aeneasr commented 1 year ago

Unless we add an interim storage such as redis, I think it would make sense to rely on database features such as follower reads to perform fast queries on (slightly stale) data. Please note that follower reads are an enterprise feature in cockroach, but queries will still work in the open source version of it (they'll just be slower).

For the time being I don't think it makes sense to add a cache such as redis. If you need inspiration, spicedb has implemented snapshot tokens using follower reads.

github-actions[bot] commented 8 months ago

Hello contributors!

I am marking this issue as stale as it has not received any engagement from the community or maintainers for a year. That does not imply that the issue has no merit! If you feel strongly about this issue

open a PR referencing and resolving the issue;
leave a comment on it and discuss ideas on how you could contribute towards resolving it;
leave a comment and describe in detail why this issue is critical for your use case;
open a new issue with updated details and a plan for resolving the issue.

Throughout its lifetime, Ory has received over 10.000 issues and PRs. To sustain that growth, we need to prioritize and focus on issues that are important to the community. A good indication of importance, and thus priority, is activity on a topic.

Unfortunately, burnout has become a topic of concern amongst open-source projects.

It can lead to severe personal and health issues as well as opening catastrophic attack vectors.

The motivation for this automation is to help prioritize issues in the backlog and not ignore, reject, or belittle anyone.

If this issue was marked as stale erroneously you can exempt it by adding the backlog label, assigning someone, or setting a milestone for it.

Thank you for your understanding and to anyone who participated in the conversation! And as written above, please do participate in the conversation if this topic is important to you!

Thank you 🙏✌️

ory / keto

Provide Consistency Guarantees using Snapshot Tokens #517