oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
252 stars 40 forks source link

[nexus] Audit transactions which may have incorrect serialization assumptions #6694

Open smklein opened 1 month ago

smklein commented 1 month ago

As highlighted by https://github.com/oxidecomputer/omicron/pull/6229#discussion_r1777381312 , the following situation may be a problem with CockroachDB:

In this scenario, a row was read by a transaction, and modified before the transaction committed. According to Cockroachdb's documentation, this is legal:

“CockroachDB doesn’t allow stale reads”... “No stale reads” means that, once a write transaction committed, every read transaction starting afterwards will see it.

Note that in this situation, the read transaction started before the subsequent write transaction, so the read was not considered stale. That being said, the value read by the transaction did get modified before it was committed.

This issue tracks taking a look at our usage of interactive transactions, for the following pattern:

  1. A SELECT statement is issued without a corresponding FOR UPDATE lock
  2. The result of that SELECT statement influences the rest of the transaction, in some way. Examples could be: Influencing a conditional in Rust code, using the result to populate a subsequent UPDATE or INSERT, etc.
  3. The original rows being SELECT-ed are not themselves modified by the transaction.
  4. If the transaction makes the assumption that the SELECT-ed rows must not be modified before the transaction calls COMMIT, then this is a potential problem.

An example of this issue was that highlighted within https://github.com/oxidecomputer/omicron/pull/6229, where we had roughly the following problem:

In this example, the following may occur:

smklein commented 1 month ago

Address Lot deletion uses a transaction to read from address_lot_rsvd_block before updating other tables. It seems possible that another row could be inserted into address_lot_rsvd_block while the transaction is currently running, which could result in address_lot_rsvd_block entries existing while the address_lot entry is soft deleted and address_lot_block are hard-deleted.

smklein commented 1 month ago

BGP configuration deletion uses a transaction to check for entries in the sps_bgp_peer_config table before deciding to soft delete an entry in the bgp_config table.

If it's possible for entries to be added to sps_bgp_peer_config concurrently with this transaction, then an old value could be read (indicating it's not in use, when it actually is).

smklein commented 1 month ago

BGP announce set creation and update both perform the following sequence of updates in a transaction:

In the case where bgp_announce_set already exists, this transaction will read that row, but not attempt to lock nor modify it. It seems plausible that a concurrent writer could delete bgp_announce_set before these transactions commit, which would result in bgp_announcement being populated for a bgp_announce_set that has been removed.

smklein commented 1 month ago

Blueprint deletion performs the following operations:

It seems possible that a blueprint is made the active blueprint (by updating bp_target ) after our initial SELECT. Even though we read a stale value -- and rely on the assumption that blueprint_id is not the current target -- this assumption may not be valid, but we'll proceed with deletion anyway.

smklein commented 1 month ago

Updating DNS seems to use a pattern of:

There's even a comment about a TOCTTOU not being an issue on "read before write" because of the SERIALIZABLE level:

       // This looks like a time-of-check-to-time-of-use race, but this
       // approach works because we're inside a transaction and the
       // isolation level is SERIALIZABLE.

I've tried to repro this with two CockroachDB shells:

# Shell 1:
CREATE TABLE versions (id UUID PRIMARY KEY, n integer, name TEXT);
INSERT INTO VERSIONS (id, n, name) values (gen_random_uuid(), 1, 'first');
INSERT INTO VERSIONS (id, n, name) values (gen_random_uuid(), 2, 'second');

# Shell 2:
BEGIN;
SELECT * FROM VERSIONS ORDER BY n DESC LIMIT 1;

# Shell 1:
INSERT INTO VERSIONS (id, n, name) values (gen_random_uuid(), 3, 'outside txn');

# Shell 2:
INSERT INTO VERSIONS (id, n, name) values (gen_random_uuid(), 3, 'inside txn');
COMMIT;

I am seeing this result in a serialization error, so I think this transaction is actually working as-is. I think this may be due to us INSERT-ing into the same table we SELECT-ed from?

EDIT: I also tried writing a unit test for this case, and am not observing a serialization error, there, which indicates this TOCTTOU could be possible? But I'm not sure why the unit test and two CRDB shells are exhibiting different behavior here.

smklein commented 1 month ago

Instance Update Network Interface seems to do the following operations:

I think in this case, the instance record returned by the instance_query may change arbitrarily in the middle of this transaction. For example, I think it's possible that this could actually update the primary interface of an instance that is not stopped, because no write lock is taken on the instance_query.

EDIT: This also applies to the other variant where a primary is not being updated -- there's still a check on the result of SELECT FROM instance, even though that might be stale: https://github.com/oxidecomputer/omicron/blob/9fb967ded78f188c2386ee5644fb1b5c799b6e11/nexus/db-queries/src/db/datastore/network_interface.rs#L799

smklein commented 1 month ago

Physical Disk Adoption checks that a Sled is "in service" before inserting a disk and zpool. The guard here is intended to verify that the Sled hasn't been expunged, but it's possible that this sled becomes expunged mid-transaction, and the physical disk adoption happens anyway.

smklein commented 1 month ago

Silo Group Deletion does the following:

Auditing this, I was concerned that a concurrent writer could add an entry to silo_group_membership and deletion would proceed anyway:

CREATE TABLE membership (id UUID PRIMARY KEY, group_id UUID);
CREATE TABLE groups (id UUID PRIMARY KEY, name TEXT);
INSERT INTO groups (id, name) VALUES ('e7cb922d-fab0-4d05-9132-9c38386dc4e0', 'hello');

# Shell 1
BEGIN;
SELECT * FROM membership WHERE group_id='e7cb922d-fab0-4d05-9132-9c38386dc4e0';

# Shell 2
INSERT INTO membership (id, group_id) VALUES (gen_random_uuid(), 'e7cb922d-fab0-4d05-9132-9c38386dc4e0');

# Shell 1
DELETE FROM groups WHERE id='e7cb922d-fab0-4d05-9132-9c38386dc4e0';
COMMIT;

However, this does seem to throw a serialization error as expected

smklein commented 1 month ago

Loopback Address Creation does the following:

This problem looks like it might appear if the read from address_lot_block could be made stale before the subsequent writes?

davepacheco commented 1 month ago

After spending some quality time with various docs and blog posts, I think we misunderstood the behavior here. Going back to the example that kicked all this off:

An example of this issue was that highlighted within #6229, where we had roughly the following problem:

  • In a Transaction: SELECT the latest blueprint target, confirms it is equal to value X. Perform a database modification, assuming that the SELECT-ed blueprint target has not been modified.
  • Concurrently, another operation may occur, which modified the "latest target blueprint".

In this example, the following may occur:

  • Txn: SELECT the blueprint target, sees value "X"
  • Another operation modified the blueprint target to "X + 1"
  • Txn: Modify the DB, issue COMMIT. No error is observed, it is not known that the target changed mid-operation.

Let's call txn1 the transaction that checks the current target blueprint and writes more data iff it matches what's expected. Let's call txn2 the transaction that changes the target blueprint.

I believe we were wrong in inferring that the behavior described above violates our application-level invariant.

We've been operating from the idea that: "we only want to write these records if the current target blueprint is what we think it is". Equivalently, "we want to avoid writing these records if the current target blueprint has changed from what we think it is". But this isn't sufficiently precise because in an MVCC world (like pretty much any modern relational database), there isn't really an objective truth about which blueprint is the target at a given point in time -- there's only the question of what value a particular transaction would see. Instead of talking about whether "the blueprint has changed during this transaction", we need to phrase the invariant in terms of what another observer might see. The behavior we're really trying to avoid is that some other transaction txn3 sees the new target blueprint (txn2's write) and not txn1's writes, and then subsequently txn3 does see txn1's writes. The behavior observed above does not show that happening. As far as I can tell, CockroachDB does promise that can't happen. I wrote a tool to help play around with this and I tried a few different interleavings. I saw that:

All of this is consistent with the application-level invariant we're trying to uphold: at no point was an observer able to see blueprint "X + 1" and not the data that txn1 wrote and then later see the data that txn1 wrote.

Another way to phrase all this: what we observed here was CockroachDB re-ordering things so that the transaction that set the target blueprint, although it completed first in wall-clock time, was logically after the transaction that checked the blueprint state and wrote the data. And there's nothing wrong with that.


Using sqldance, I also went through the example in #6712. That example is trying to show a case where the DNS update code would be broken by another transaction arriving and changing the DNS generation in the middle of the transaction that normally tries to update it. The end result in the test does violate the assumed invariant that there can only be one version. But that's because of the the transaction that inserts a new version without checking the current one first. If I instead change the sqldance example to do what the real DNS update code does, which is to read the current DNS version and then try to update it, then one of the two attempted updates always fails. So I think that in the end the DNS update code is correct, and for the reason mentioned in the comment about the transactions being serializable.


I know we all (myself included) have been confused at various points in the last few days and I'd love someone to verify this reasoning and it'd also be good to work through any other examples that we've found confusing. But from the cases I've looked at, the code we have seems correct and the concurrency model is as strong as we've been assuming.

andrewjstone commented 1 month ago

Dave, I spent a while this weekend going through this as well and I came up with essentially the same answer as you. I'm going to write my thinking down, not because I don't think you realize this but to confirm we are on the same page and to provide some additional notes for others.

I think you made a mistake in this line here though, which you correct in prose below that:

If txn3 arrives after txn2 completes and after txn1 has written the data, then txn3 blocks until txn1 completes and in the end all three transactions complete successfully with the logical ordering being: txn2, txn1, txn3.

The above states the physical time of transaction completion, but the logical serialization schedule is txn1, txn2, txn3. This is because txn1 read X first, and txn2 wrote X+1, therefore txn1 must be ordered logically before txn2. The other data being written by txn1 is unrelated to the any data in txn2, and so this is what allows the logical serial order of these two transactions to operate concurrently and then commit independently.

I think the introduction of txn3 shows what a conflict looks like and when txn1 would abort, but it's not strictly necessary to infer the ordering of txn1 and txn2based on serializability. CRDB runs txn1 and txn2 in parallel (via MVCC) for performance reasons, and txn1 reads X before txn2 commits. MVCC importantly allows the "writer to not block the reader". Therefore, for concurrent transactions, as long as there are not other conflicts, a read and write of the same value (X in this case) allows both transactions to commit independently. This is because the logical serialization order is not the same as the commit time. There is no "real-time" component to serializability in principle, although in practice CRDB limits how long transactions can run without a retry error. But you could imagine a hypothetical database where txn1 and txn2 are the only two transactions that ever run and they start concurrently. The operate such that txn1 reads val=X and txn2 writes val=X+1. txn2 commits. Then txn1 performs a computation that takes 2 years and computes Y. Eventually, it writes the output to val2 = Y and commits. txn1 is still logically ordered before transaction 2, even though it committed 2 years later. This is allowed by serializability, whereas it wouldn't be allowed by linearizability or stict serializability.

Now, on the other hand if 2 years later, txn2 tried to write val=Y (our original variable) it would abort, since it read a value, and another value got written in the meantime to the same location before that node tried to write. There's no serial order that would allow that.

Now coming back to your example with txn3. Let's assume that both txn1 and txn2 commit (because of the reasons above), and then we start txn3. txn1 will indeed still write the extra data. Then when txn3 starts it will read X+1 and the extra data written when the target was still X. This is actually the same behavior as txn1 committing, then txn2 starting and committing, and then txn3 running. If it is ok for txn3 to see X+1 and the extra data (as long as there is no point where txn3 doesn't see the extra data and then it suddenly becomes available, then I agree with your conclusion.

FWIW, the CMU intro to db lectures are very good and cover concurrency control well. Starting here and continuing for the next few lectures as needed may help us all.

davepacheco commented 1 month ago

I think you made a mistake in this line here though, which you correct in prose below that:

If txn3 arrives after txn2 completes and after txn1 has written the data, then txn3 blocks until txn1 completes and in the end all three transactions complete successfully with the logical ordering being: txn2, txn1, txn3.

The above states the physical time of transaction completion, but the logical serialization schedule is txn1, txn2, txn3.

Agreed!

smklein commented 1 month ago

Just to hammer this home, I figured I'd draw out the serialization conflicts via a graph for the blueprint case within sqldance's recent-exmaples.txt

Looking specifically for "RW, WR, or WW" conflicts, using the lens @andrewjstone mentioned from the notes on DB concurrency control:

c1: BEGIN;
c1: SELECT * from bp_target ORDER BY gen DESC LIMIT 1;
c2:                       BEGIN;
c2:                       SELECT * from bp_target ORDER BY gen DESC LIMIT 1;
c2:                       INSERT INTO bp_target (gen) VALUES (2);
c2:                       COMMIT;
c1: INSERT INTO data VALUES (1);
c1: COMMIT;
c1 c2
R(bp_target)
R(bp_target)
W(bp_target)
W(data)

image

There is a dependency between the two transactions, but no cycles, so according to conflict serializability, it's valid for both of these transactions to commit.

c1: BEGIN;
c1: SELECT * from bp_target ORDER BY gen DESC LIMIT 1;
c2:                       BEGIN;
c2:                       SELECT * from bp_target ORDER BY gen DESC LIMIT 1;
c2:                       INSERT INTO bp_target (gen) VALUES (2);
c2:                       COMMIT;
# This is the point where we could conceivably see generation 2 but missing data
# from generation 1.
c3: SELECT * from bp_target;
c3: SELECT * from data WHERE x = 2;
c1: INSERT INTO data VALUES (1);
c1: COMMIT;
c1 c2 c3
R(bp_target)
R(bp_target)
W(bp_target)
R(bp_target)
R(data)
W(data)

image

A cycle exists here, which explains the serializability error -- there is no ordering of transactions that is valid where we don't violate RW/WR semantics.

Finally, for the blocking case:

c1: BEGIN;
c1: SELECT * from bp_target ORDER BY gen DESC LIMIT 1;
c2:                       BEGIN;
c2:                       SELECT * from bp_target ORDER BY gen DESC LIMIT 1;
c2:                       INSERT INTO bp_target (gen) VALUES (2);
c2:                       COMMIT;
c1: INSERT INTO data VALUES (1);
c3: SELECT * from bp_target;
# This operation blocks
c3: SELECT * from data WHERE x = 2;
c1 c2 c3
R(bp_target)
R(bp_target)
W(bp_target)
W(data)
R(bp_target)
R(data)

image

From the graph (dotted line showing a dependency that only emerges if we actually try the c3: SELECT * FROM DATA line) we can see that the ordering must be C1 -> C2 -> C3.

From a data-dependency point of view, it's clear that C3's read of data is dependent on C1's transaction completing -- it has a pending write to data -- but it's interesting that CockroachDB chooses to block here, rather than force a retry of C3.

smklein commented 1 month ago

Thanks a ton for the notes on this, @davepacheco and @andrewjstone - I read through these materials, and found them immensely useful.

I believe that the question of "can you find a circular dependency in a graph of the nodes" is kinda the crux of the question "does CRDB consider these transactions conflicting or not?"

I bring this up because I think there are cases where the "third observer matters", and cases where the scenario can be collapsed into two transactions.

@davepacheco , for example, with your sqldance snippet which introduces a "third party" to observe the state of the world in the blueprint example, and trigger a serialization error:

c1: BEGIN;
c1: SELECT * from bp_target ORDER BY gen DESC LIMIT 1;
c2:                       BEGIN;
c2:                       SELECT * from bp_target ORDER BY gen DESC LIMIT 1;
c2:                       INSERT INTO bp_target (gen) VALUES (2);
c2:                       COMMIT;
# This is the point where we could conceivably see generation 2 but missing data
# from generation 1.
c3: SELECT * from bp_target;
c3: SELECT * from data WHERE x = 2;
c1: INSERT INTO data VALUES (1);
c1: COMMIT;

I don't believe it's actually necessary to have three observers to cause this situation to happen. It can also be triggered with two observers, as long as a circular dependency is forced between RW/WR/WW operations:

c1: BEGIN;
c1: SELECT * from bp_target ORDER BY gen DESC LIMIT 1;
c2:                       BEGIN;
c2:                       SELECT * from bp_target ORDER BY gen DESC LIMIT 1;
c2:                       INSERT INTO bp_target (gen) VALUES (2);
# This is new, and load-bearing! But we don't need a third party to do it, we can
# cause the circular dependency ourselves.
c2:                       SELECT * from data WHERE x = 2;
c2:                       COMMIT;
c1: INSERT INTO data VALUES (1);
c1: COMMIT;

(This also results in a serialization error, as there is no valid ordering of C1 and C2 if they're both operating optimistically)

smklein commented 1 month ago

(Addressing some of the earlier examples I dug up)

A lot of the prior examples had the form:

Where a concurrent transaction does the following:

What can go wrong: It's possible for the check in txn 1 to pass, and then immediately become invalidated by a concurrent transaction (e.g., a "pointer row" exists to another row that is actively getting deleted).

For example, this is the case for blueprint deletion.

I came up with an example of this behavior using sqldance -- note that this does not fail with a retry error:

# Clean up from last time.
setup: DROP TABLE IF EXISTS bp_target;
setup: DROP TABLE IF EXISTS bp;
setup: CREATE TABLE bp_target (id INT PRIMARY KEY, gen INT);
setup: CREATE TABLE bp (id INT PRIMARY KEY);

setup: INSERT INTO bp VALUES (1);
setup: INSERT INTO bp VALUES (2);

# The target is blueprint 1 when we start
setup: INSERT INTO bp_target VALUES (1, 1);

c1: BEGIN;
c1: SELECT * FROM bp_target ORDER BY id DESC LIMIT 1;
c2:         BEGIN;
c2:         SELECT * FROM bp_target ORDER BY id DESC LIMIT 1;
c2:         INSERT INTO bp_target VALUES (2, 2);
c2:         COMMIT;
c1: DELETE FROM bp WHERE id = 2;
c1: COMMIT;

# Report the final results.
setup: SELECT * from bp_target;
setup: SELECT * from bp;

This case ends up with a bp_target entry pointing to a blueprint with id = 2, but that blueprint was fully deleted from the bp table. This is bad state, regardless of whether or not someone else observed it!

c1 c2
R(bp_target)
R(bp_target)
W(bp_target)
W(bp)

In this scenario c1 -> c2 because of the R(bp_target) -> W(bp_target) ordering.

I believe that prevention of this case is possible, but must be taken by c2: It's important to include a read of bp, before we do the INSERT:

c1: BEGIN;
c1: SELECT * FROM bp_target ORDER BY id DESC LIMIT 1;
c2:         BEGIN;
c2:         SELECT * FROM bp_target ORDER BY id DESC LIMIT 1;
# This is important!
c2:         SELECT * FROM bp WHERE id = 2;
c2:         INSERT INTO bp_target VALUES (2, 2);
c2:         COMMIT;
c1: DELETE FROM bp WHERE id = 2;
c1: COMMIT;

By including this line, we have the following orderings:

Therefore, the latter transaction (c1 in this case) is forced to retry if there is true concurrent access.

andrewjstone commented 1 month ago

From a data-dependency point of view, it's clear that C3's read of data is dependent on C1's transaction completing -- it has a pending write to data -- but it's interesting that CockroachDB chooses to block here, rather than force a retry of C3.

There is actually a 2020 SIGMOD paper about CRDB that I didn't know existed until yesterday, and which I have yet to read. However, Murat summarizes it well as usual. The important bits for us are the concurrency control section. Interestingly, it appears that the only time CRDB blocks is in the write-read conflict case, which is exactly what we have in your example.

Write-read conflicts. A read running into an uncommitted intent with a lower timestamp will wait for the earlier transaction to finalize. Waiting is implemented using in-memory queue structures. A read running into an uncommitted intent with a higher timestamp ignores the intent and does not need to wait.

This is fairly odd in terms of optimistic concurrency control, but CRDB is also an odd, and very original beast when it comes to MVCC. They only use a single timestamp due to the reliance on hybrid logical clocks, which gives them some real time capability across nodes. So sometimes they re-order transactions or restart them, sometimes they wait apparently!

davepacheco commented 1 month ago

This case ends up with a bp_target entry pointing to a blueprint with id = 2, but that blueprint was fully deleted from the bp table. This is bad state, regardless of whether or not someone else observed it! c1 c2 R(bp_target)
R(bp_target) W(bp_target) W(bp)

In this scenario c1 -> c2 because of the R(bp_target) -> W(bp_target) ordering.

I believe that prevention of this case is possible, but must be taken by c2: It's important to include a read of bp, before we do the INSERT:

Agreed -- I think at one point we talked about having it look like INSERT INTO bp_target (blueprint_id, ...) VALUES ((SELECT from bp_blueprint WHERE id = my_id)) -- i.e., so that you're selecting the id out of the bp_blueprint table (even though you know it) as you try to insert it.

davepacheco commented 1 month ago

Write-read conflicts. A read running into an uncommitted intent with a lower timestamp will wait for the earlier transaction to finalize. Waiting is implemented using in-memory queue structures. A read running into an uncommitted intent with a higher timestamp ignores the intent and does not need to wait.

This is fairly odd in terms of optimistic concurrency control, but CRDB is also an odd, and very original beast when it comes to MVCC. They only use a single timestamp due to the reliance on hybrid logical clocks, which gives them some real time capability across nodes. So sometimes they re-order transactions or restart them, sometimes they wait apparently!

The CockroachDB architecture "Transaction Layer" docs do talk about this some, culminating in the "Transaction conflicts" section, though I had to go back and read the earlier sections about write intents, timestamps, etc. to make sense of it. (Reading this made me want to go find a talk that walks through how writes work in CockroachDB.)

Here are some excerpts I found helpful:

When the transaction layer executes write operations, it doesn't directly write values to disk. Instead, it creates several things that help it mediate a distributed transaction: ... Replicated Locks (also known as write intents) are replicated via Raft, and act as a combination of a provisional value and an exclusive lock. They are essentially the same as standard multi-version concurrency control (MVCC) values but also contain a pointer to the transaction record stored on the cluster. ... As write intents are created, CockroachDB checks for newer committed values. If newer committed values exist, the transaction may be restarted. If existing write intents for the same keys exist, it is resolved as a transaction conflict. ... Reading If the transaction has not been aborted, the transaction layer begins executing read operations. If a read only encounters standard MVCC values, everything is fine. However, if it encounters any write intents, the operation must be resolved as a transaction conflict.

Then under "Transaction conflicts":

To make this simpler to understand, we'll call the first transaction TxnA and the transaction that encounters its write intents TxnB.

It goes on to say that in the cases we care about (i.e., where transaction priority is not set and where the writer is still alive), "TxnB enters the TxnWaitQueue to wait for TxnA to complete." I think the summary of all of this is that if SERIALIZABLE transactions can possibly complete successfully by blocking, they do.

The docs for the latest stable (24.2.3) talk a lot more about the READ COMMITTED isolation level. They must have added this in a version after 22.1. This section is an incredibly helpful summary of both the data visibility and blocking behavior of transactions at different isolation levels. Highlights:

While writes in a SERIALIZABLE transaction can block reads in concurrent SERIALIZABLE transactions, they will not block reads in concurrent READ COMMITTED transactions. Writes in a READ COMMITTED transaction will never block reads in concurrent transactions, regardless of their isolation levels.

If a SERIALIZABLE transaction S1 writes but does not commit before a SERIALIZABLE transaction S2, the first statement in S2 that would observe an unwritten row from S1 will be blocked until S1 commits or aborts.

I'm assuming the bits about SERIALIZABLE transactions are valid in 22.1 because it's consistent with our understanding and seems like it wouldn't be something you could just change in a software update without breaking people's applications... but I did not check the release notes to verify that.

smklein commented 1 month ago

I'm still working through the question of "When is there a Read-Write dependency from Cockroach's point-of-view". I'm finding this surprisingly subtle.

Re-using my example from the delete case, I'll walk through some examples. All of these have the setup:

# Clean up from last time.
setup: DROP TABLE IF EXISTS bp_target;
setup: DROP TABLE IF EXISTS bp;
setup: CREATE TABLE bp_target (id INT PRIMARY KEY, gen INT);
setup: CREATE TABLE bp (id INT PRIMARY KEY);

setup: INSERT INTO bp VALUES (1);
setup: INSERT INTO bp VALUES (2);

# The target is blueprint 1 when we start
setup: INSERT INTO bp_target VALUES (1, 1);

Bad: No Circular dependency

c1: BEGIN;
c1: SELECT * FROM bp_target ORDER BY id DESC LIMIT 1;
c2:         BEGIN;
c2:         SELECT * FROM bp_target ORDER BY id DESC LIMIT 1;
c2:         INSERT INTO bp_target VALUES (2, 2);
c2:         COMMIT;
c1: DELETE FROM bp WHERE id = 2;
c1: COMMIT;

(No transaction retry)

To recap, in this case, we do the wrong thing, and "successfully delete the blueprint that's the current target", because c2 does not have any ordering dependency on bp.

All the follow-up queries I'm making try to answer: "what is considered an overlap of bp?"

Good: Read exactly the row we delete

c1: BEGIN;
c1: SELECT * FROM bp_target ORDER BY id DESC LIMIT 1;
c2:         BEGIN;
c2:         SELECT * FROM bp_target ORDER BY id DESC LIMIT 1;
# This is the important SELECT
c2:         SELECT * FROM bp WHERE id = 2;
c2:         INSERT INTO bp_target VALUES (2, 2);
c2:         COMMIT;
c1: DELETE FROM bp WHERE id = 2;
c1: COMMIT;

In this example, we see a transaction retry error, presumably because there is a R(bp) -> W(bp) ordering, by accessing exactly the same row of bp in both transactions.

Bad: Read exactly a different row

c1: BEGIN;
c1: SELECT * FROM bp_target ORDER BY id DESC LIMIT 1;
c2:         BEGIN;
c2:         SELECT * FROM bp_target ORDER BY id DESC LIMIT 1;
# This is the important SELECT
c2:         SELECT * FROM bp WHERE id = 1;
c2:         INSERT INTO bp_target VALUES (2, 2);
c2:         COMMIT;
c1: DELETE FROM bp WHERE id = 2;
c1: COMMIT;

In this example, we do not see a transaction retry error, and we again delete the current target blueprint. This happens because there is no R(bp) -> W(bp) ordering -- the read and write to bp operate on totally distinct keys.

Good: Read exactly the same row, but through a more generic query

c1: BEGIN;
c1: SELECT * FROM bp_target ORDER BY id DESC LIMIT 1;
c2:         BEGIN;
c2:         SELECT * FROM bp_target ORDER BY id DESC LIMIT 1;
# This is the important SELECT
c2:         SELECT * FROM bp ORDER BY id DESC LIMIT 1;
c2:         INSERT INTO bp_target VALUES (2, 2);
c2:         COMMIT;
c1: DELETE FROM bp WHERE id = 2;
c1: COMMIT;

In this example, we see a transaction retry error: this is similar to directly SELECT-ing the row from bp WHERE id = 2, which causes a Read->Write ordering. (See also: "Read exactly the row we delete").

Good: Read exactly a different row, but through a more generic query?

c1: BEGIN;
c1: SELECT * FROM bp_target ORDER BY id DESC LIMIT 1;
c2:         BEGIN;
c2:         SELECT * FROM bp_target ORDER BY id DESC LIMIT 1;
# This is the important SELECT
c2:         SELECT * FROM bp ORDER BY id ASC LIMIT 1;
c2:         INSERT INTO bp_target VALUES (2, 2);
c2:         COMMIT;
c1: DELETE FROM bp WHERE id = 2;
c1: COMMIT;

In this example, we do see a transaction retry error, even though we read the row where bp has id = 1. This is worth comparing to the "Read exactly a different row" case -- even though we read a non-overlapping row from bp, from a transaction conflict point-of-view, it was still considered overlapping!

EDIT: Kudos to @davepacheco , this last query also conditionally causes a retry error, depending on the version of CockroachDB being used.

On version v22.1.9, I saw this error. On v23.2.4, @davepacheco reported no transaction retry error.

davepacheco commented 1 month ago

I'm still working through the question of "When is there a Read-Write dependency from Cockroach's point-of-view". I'm finding this surprisingly subtle.

Re-using my example from the delete case, I'll walk through some examples. All of these have the setup:

# Clean up from last time.
setup: DROP TABLE IF EXISTS bp_target;
setup: DROP TABLE IF EXISTS bp;
setup: CREATE TABLE bp_target (id INT PRIMARY KEY, gen INT);
setup: CREATE TABLE bp (id INT PRIMARY KEY);

setup: INSERT INTO bp VALUES (1);
setup: INSERT INTO bp VALUES (2);

# The target is blueprint 1 when we start
setup: INSERT INTO bp_target VALUES (1, 1);

Bad: No Circular dependency

c1: BEGIN;
c1: SELECT * FROM bp_target ORDER BY id DESC LIMIT 1;
c2:         BEGIN;
c2:         SELECT * FROM bp_target ORDER BY id DESC LIMIT 1;
c2:         INSERT INTO bp_target VALUES (2, 2);
c2:         COMMIT;
c1: DELETE FROM bp WHERE id = 2;
c1: COMMIT;

(No transaction retry)

To recap, in this case, we do the wrong thing, and "successfully delete the blueprint that's the current target", because c2 does not have any ordering dependency on bp.

This makes sense to me. The ordering should result in c1 < c2 so you'll wind up with the invariant violated (target blueprint is not present).

It's worth noting that c2's transaction is broken even without any concurrency because it never checked that blueprint 2 was present.

All the follow-up queries I'm making try to answer: "what is considered an overlap of bp?"

Good: Read exactly the row we delete

c1: BEGIN;
c1: SELECT * FROM bp_target ORDER BY id DESC LIMIT 1;
c2:         BEGIN;
c2:         SELECT * FROM bp_target ORDER BY id DESC LIMIT 1;
# This is the important SELECT
c2:         SELECT * FROM bp WHERE id = 2;
c2:         INSERT INTO bp_target VALUES (2, 2);
c2:         COMMIT;
c1: DELETE FROM bp WHERE id = 2;
c1: COMMIT;

In this example, we see a transaction retry error, presumably because there is a R(bp) -> W(bp) ordering, by accessing exactly the same row of bp in both transactions.

Makes sense. This seems like applying the straightforward "fix" of having c2 check that the blueprint is present.

Bad: Read exactly a different row

c1: BEGIN;
c1: SELECT * FROM bp_target ORDER BY id DESC LIMIT 1;
c2:         BEGIN;
c2:         SELECT * FROM bp_target ORDER BY id DESC LIMIT 1;
# This is the important SELECT
c2:         SELECT * FROM bp WHERE id = 1;
c2:         INSERT INTO bp_target VALUES (2, 2);
c2:         COMMIT;
c1: DELETE FROM bp WHERE id = 2;
c1: COMMIT;

In this example, we do not see a transaction retry error, and we again delete the current target blueprint. This happens because there is no R(bp) -> W(bp) ordering -- the read and write to bp operate on totally distinct keys.

It makes sense to me that this has the same behavior as the one above since it hasn't changed anything about the relationship between the two transactions. (This version still has the problem that it can set a blueprint to the current target even if it doesn't exist, even without any concurrency.)

Good: Read exactly the same row, but through a more generic query

c1: BEGIN;
c1: SELECT * FROM bp_target ORDER BY id DESC LIMIT 1;
c2:         BEGIN;
c2:         SELECT * FROM bp_target ORDER BY id DESC LIMIT 1;
# This is the important SELECT
c2:         SELECT * FROM bp ORDER BY id DESC LIMIT 1;
c2:         INSERT INTO bp_target VALUES (2, 2);
c2:         COMMIT;
c1: DELETE FROM bp WHERE id = 2;
c1: COMMIT;

In this example, we see a transaction retry error: this is similar to directly SELECT-ing the row from bp WHERE id = 2, which causes a Read->Write ordering. (See also: "Read exactly the row we delete").

I think this is working (arguably) by accident. It happens to read the same row, and so the transactions end up related, but I think that's basically a coincidence here right? Also this one will still set a blueprint to the target even if it doesn't exist, even without concurrency, unless the one we're setting to the target happens to be also the latest one by id.

Good: Read exactly a different row, but through a more generic query?

c1: BEGIN;
c1: SELECT * FROM bp_target ORDER BY id DESC LIMIT 1;
c2:         BEGIN;
c2:         SELECT * FROM bp_target ORDER BY id DESC LIMIT 1;
# This is the important SELECT
c2:         SELECT * FROM bp ORDER BY id ASC LIMIT 1;
c2:         INSERT INTO bp_target VALUES (2, 2);
c2:         COMMIT;
c1: DELETE FROM bp WHERE id = 2;
c1: COMMIT;

In this example, we do see a transaction retry error, even though we read the row where bp has id = 1. This is worth comparing to the "Read exactly a different row" case -- even though we read a non-overlapping row from bp, from a transaction conflict point-of-view, it was still considered overlapping!

It surprises me that this produces an error. I discovered by accident that it doesn't produce an error in 23.2.4 on my Mac.

Looking at it more closely:

I think these are related in that if c1 < c2, then c1 reads bp_target 1, then deletes bp 2; and then c2 reads bp_target 1 and then inserts bp_target 2. As in the above examples, c2's transaction is wrong even without concurrency since it doesn't check if bp2 still exists.

If c2 < c1, things would be different: c2 sees bp_target 1 and sets a new target of 2; but this time c1 sees bp_target 2 and then deletes it [anyway, which the application presumably wouldn't actually do]. However, I don't think this ordering is possible if the transactions are interactive like this because Cockroach would have already reported bp_target 1 to c1. So I think that rules out c2 < c1.

So I'm back to wondering why this ever produces an error. Offline, @smklein sent me the output:

conn "c1": executing: "COMMIT;"
conn "c1": error: db error: ERROR: restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed preemptive refresh due to a conflict: committed value on key /Table/122/1/2/0): "sql txn" meta={id=d05e7503 key=/Table/123/1/2 pri=0.00349525 epo=0 ts=1727825054.678770998,1 min=1727825054.677760352,0 seq=1} lock=true stat=PENDING rts=1727825054.677760352,0 wto=false gul=1727825055.177760352,0
HINT: See: https://www.cockroachlabs.com/docs/v22.1/transaction-retry-error-reference.html#retry_serializable: ERROR: restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed preemptive refresh due to a conflict: committed value on key /Table/122/1/2/0): "sql txn" meta={id=d05e7503 key=/Table/123/1/2 pri=0.00349525 epo=0 ts=1727825054.678770998,1 min=1727825054.677760352,0 seq=1} lock=true stat=PENDING rts=1727825054.677760352,0 wto=false gul=1727825055.177760352,0
HINT: See: https://www.cockroachlabs.com/docs/v22.1/transaction-retry-error-reference.html#retry_serializable

I'm curious what /Table/122/1/2/0 denotes. Based on the book docs, I expected a row in our table to be called something like /$table_id/$key or even /bp/$key (bp being the table name) but it could be that 122 is the table id and it's just nested under /Table and that got left out of the docs. Decoding this should explain what it was conflicting on I think.