oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
252 stars 40 forks source link

versions_have_idempotent_up test fails with "Failed to ROLLBACK failed transaction...there is no transaction in progress" #4409

Open sunshowers opened 1 year ago

sunshowers commented 1 year ago

Test omicron-nexus::test_all integration_tests::schema::versions_have_idempotent_up failed with

thread 'integration_tests::schema::versions_have_idempotent_up' panicked at 'Failed to ROLLBACK failed transaction: Error { kind: Db, cause: Some(DbError { severity: "ERROR", parsed_severity: Some(Error), code: SqlState(Other("XXUUU")), message: "there is no transaction in progress", detail: None, hint: None, position: None, where_: None, schema: None, table: None, column: None, datatype: None, constraint: None, file: Some("exec_util.go"), line: Some(741), routine: Some("init") }) }', nexus/tests/integration_tests/schema.rs:96:22

Buildomat log

From @davepacheco:

It's not clear to me why we'd ever expect to get a serializeable error here or what the logic here is trying to do in terms of retrying (or why it's issuing a rollback, since the only that can have failed here is the commit

davepacheco commented 1 year ago

Specifically:

versions_have_idempotent_up: Failed to apply update as transaction
    err = db error: ERROR: restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE): "sql txn" meta={id=322afc5a key=/Table/15/1/913563555658432513/0 pri=0.00044531 epo=0 ts=1698867872.271914203,1 min=1698867871.782975389,0 seq=21} lock=true stat=PENDING rts=1698867872.253356795,1 wto=false gul=1698867872.282975389,0\nHINT: See: https://www.cockroachlabs.com/docs/v22.1/transaction-retry-error-reference.html#retry_serializable
    target version = 1.0.0

Now, apply_update_as_transaction_inner() only executes three SQL statements: BEGIN, some caller-provided SQL, and COMMIT. It panics if either of the first two fail. That means it was the COMMIT that failed -- and further it seems like that's the only way to each L93 where we issue the ROLLBACK. But according to https://www.cockroachlabs.com/docs/stable/commit-transaction:

For non-retryable transactions, if statements in the transaction generated any errors, COMMIT is equivalent to ROLLBACK, which aborts the transaction and discards all updates made by its statements.

so if the COMMIT fails, it seems like the session will always not be in a transaction, so the subsequent rollback would always faill in this way.

I don't understand:

bnaecker commented 1 year ago

Just noting I also saw this on one of my PR workflows here, with the failing log file here

rcgoodfellow commented 12 months ago

Hit this here

jgallagher commented 12 months ago

I hit this here; did a tiny bit of poking around docs and have a couple comments/questions:

so if the COMMIT fails, it seems like the session will always not be in a transaction, so the subsequent rollback would always fail in this way.

Agreed

why we ever hit a serialization error here -- aren't we the only one operating on the database?

Strongly agree - all the bullet points in the docs for RETRY_SERIALIZABLE involve at least two different transactions interacting with each other, which doesn't seem possible in this case, unless some other test is reusing the same CRDB instance? I am moderately worried that if we try to fix / workaround this, we'd be masking some underlying CRDB issue we don't understand, that might only show up on illumos. (We only have 4 data points in this issue, but all 4 of those were on the helios test job.)

--

Following one of the links from the RETRY_SERIALIZABLE docs for batching queries, I landed here, which has an example of a single batch string bracketed by BEGIN/COMMIT followed by this note:

The enable_implicit_transaction_for_batch_statements session variable defaults to true. This means that any batch of statements is treated as an implicit transaction, so the BEGIN/COMMIT commands are not needed to group all the statements in one transaction.

Based on that, do we even need the explicit and separate BEGIN / COMMIT statements in apply_update_as_transaction_inner? If we understood why we were getting serialization errors and just wanted to fix them, I think I'd propose something like this diff:

@@ -64,19 +64,6 @@ async fn test_setup<'a>(
     builder
 }

-// Attempts to apply an update as a transaction.
-//
-// Only returns an error if the transaction failed to commit.
-async fn apply_update_as_transaction_inner(
-    client: &omicron_test_utils::dev::db::Client,
-    sql: &str,
-) -> Result<(), tokio_postgres::Error> {
-    client.batch_execute("BEGIN;").await.expect("Failed to BEGIN transaction");
-    client.batch_execute(&sql).await.expect("Failed to execute update");
-    client.batch_execute("COMMIT;").await?;
-    Ok(())
-}
-
 // Applies an update as a transaction.
 //
 // Automatically retries transactions that can be retried client-side.
@@ -86,14 +73,12 @@ async fn apply_update_as_transaction(
     sql: &str,
 ) {
     loop {
-        match apply_update_as_transaction_inner(client, sql).await {
+        // Batch statements are implicitly wrapped in a transaction per
+        // https://www.cockroachlabs.com/docs/v23.1/transactions#batched-statements.
+        match client.batch_execute(sql).await {
             Ok(()) => break,
             Err(err) => {
                 warn!(log, "Failed to apply update as transaction"; "err" => err.to_string());
-                client
-                    .batch_execute("ROLLBACK;")
-                    .await
-                    .expect("Failed to ROLLBACK failed transaction");
                 if let Some(code) = err.code() {
                     if code == &tokio_postgres::error::SqlState::T_R_SERIALIZATION_FAILURE {
                         warn!(log, "Transaction retrying");

but I'm very hesitant to actually propose that before understanding whether batch_execute does indeed get an implicit transaction and why we're seeing serialization failures at all. When I apply this diff locally, the versions_have_idempotent_up test gets much slower (~90 seconds, but it does pass), so maybe I have fundamentally misunderstood something.

davepacheco commented 12 months ago

I wonder if there's some instrumentation we could add so that when we see this again in CI, we'll be able to see what other transactions were going on.

I wonder if we already have such instrumentation? I believe the tests deliberately preserve the whole CockroachDB data directory on failure and buildomat saves whatever's left around.

smklein commented 12 months ago

@jgallagher thanks for pointing out the Helios-specific behavior -- I left the schema integration tests running in a loop overnight on a Linux machine, and couldn't repro. I'll try again on my Helios box!