Open sunshowers opened 1 year ago
Specifically:
apply_update_as_transaction()
at L96apply_update_as_transaction_inner()
and got an errorversions_have_idempotent_up: Failed to apply update as transaction
err = db error: ERROR: restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE): "sql txn" meta={id=322afc5a key=/Table/15/1/913563555658432513/0 pri=0.00044531 epo=0 ts=1698867872.271914203,1 min=1698867871.782975389,0 seq=21} lock=true stat=PENDING rts=1698867872.253356795,1 wto=false gul=1698867872.282975389,0\nHINT: See: https://www.cockroachlabs.com/docs/v22.1/transaction-retry-error-reference.html#retry_serializable
target version = 1.0.0
Now, apply_update_as_transaction_inner()
only executes three SQL statements: BEGIN
, some caller-provided SQL, and COMMIT
. It panics if either of the first two fail. That means it was the COMMIT
that failed -- and further it seems like that's the only way to each L93 where we issue the ROLLBACK
. But according to https://www.cockroachlabs.com/docs/stable/commit-transaction:
For non-retryable transactions, if statements in the transaction generated any errors, COMMIT is equivalent to ROLLBACK, which aborts the transaction and discards all updates made by its statements.
so if the COMMIT fails, it seems like the session will always not be in a transaction, so the subsequent rollback would always faill in this way.
I don't understand:
ROLLBACK
at all in this caseROLLBACK
and try againHit this here
I hit this here; did a tiny bit of poking around docs and have a couple comments/questions:
so if the COMMIT fails, it seems like the session will always not be in a transaction, so the subsequent rollback would always fail in this way.
Agreed
why we ever hit a serialization error here -- aren't we the only one operating on the database?
Strongly agree - all the bullet points in the docs for RETRY_SERIALIZABLE
involve at least two different transactions interacting with each other, which doesn't seem possible in this case, unless some other test is reusing the same CRDB instance? I am moderately worried that if we try to fix / workaround this, we'd be masking some underlying CRDB issue we don't understand, that might only show up on illumos. (We only have 4 data points in this issue, but all 4 of those were on the helios test job.)
--
Following one of the links from the RETRY_SERIALIZABLE
docs for batching queries, I landed here, which has an example of a single batch string bracketed by BEGIN
/COMMIT
followed by this note:
The enable_implicit_transaction_for_batch_statements session variable defaults to true. This means that any batch of statements is treated as an implicit transaction, so the BEGIN/COMMIT commands are not needed to group all the statements in one transaction.
Based on that, do we even need the explicit and separate BEGIN
/ COMMIT
statements in apply_update_as_transaction_inner
? If we understood why we were getting serialization errors and just wanted to fix them, I think I'd propose something like this diff:
@@ -64,19 +64,6 @@ async fn test_setup<'a>(
builder
}
-// Attempts to apply an update as a transaction.
-//
-// Only returns an error if the transaction failed to commit.
-async fn apply_update_as_transaction_inner(
- client: &omicron_test_utils::dev::db::Client,
- sql: &str,
-) -> Result<(), tokio_postgres::Error> {
- client.batch_execute("BEGIN;").await.expect("Failed to BEGIN transaction");
- client.batch_execute(&sql).await.expect("Failed to execute update");
- client.batch_execute("COMMIT;").await?;
- Ok(())
-}
-
// Applies an update as a transaction.
//
// Automatically retries transactions that can be retried client-side.
@@ -86,14 +73,12 @@ async fn apply_update_as_transaction(
sql: &str,
) {
loop {
- match apply_update_as_transaction_inner(client, sql).await {
+ // Batch statements are implicitly wrapped in a transaction per
+ // https://www.cockroachlabs.com/docs/v23.1/transactions#batched-statements.
+ match client.batch_execute(sql).await {
Ok(()) => break,
Err(err) => {
warn!(log, "Failed to apply update as transaction"; "err" => err.to_string());
- client
- .batch_execute("ROLLBACK;")
- .await
- .expect("Failed to ROLLBACK failed transaction");
if let Some(code) = err.code() {
if code == &tokio_postgres::error::SqlState::T_R_SERIALIZATION_FAILURE {
warn!(log, "Transaction retrying");
but I'm very hesitant to actually propose that before understanding whether batch_execute
does indeed get an implicit transaction and why we're seeing serialization failures at all. When I apply this diff locally, the versions_have_idempotent_up
test gets much slower (~90 seconds, but it does pass), so maybe I have fundamentally misunderstood something.
I wonder if there's some instrumentation we could add so that when we see this again in CI, we'll be able to see what other transactions were going on.
I wonder if we already have such instrumentation? I believe the tests deliberately preserve the whole CockroachDB data directory on failure and buildomat saves whatever's left around.
@jgallagher thanks for pointing out the Helios-specific behavior -- I left the schema integration tests running in a loop overnight on a Linux machine, and couldn't repro. I'll try again on my Helios box!
Test
omicron-nexus::test_all integration_tests::schema::versions_have_idempotent_up
failed withBuildomat log
From @davepacheco:
BEGIN;
and the update SQL, thenCOMMIT;
COMMIT;
failed and CockroachDB reportsrestart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE)