Open askfongjojo opened 1 year ago
I have not examined this in detail. But it looks similar to oxidecomputer/remote-access-preview#23, which was resolved by #1688 (which amounted to simplifying the database transaction to avoid so many round-trips between Nexus and CockroachDB during which other transactions could come in and invalidate state).
It looks like in this case the query is this one: https://github.com/oxidecomputer/omicron/blob/f8c5a21054ff4f50ce8344699e6af7885a3d2554/nexus/db-queries/src/db/datastore/sled.rs#L90-L189
which is not nearly so complicated and only appears to involve two round-trips. It essentially seems to be saying: select a sled on which to put a reservation, then insert the reservation. Although this is one transaction, that doesn't mean that two concurrent executions are queued in the database. Instead, I think the database is free to just fail one of them with exactly this sort of error: https://www.cockroachlabs.com/docs/stable/transactions.html#transaction-retries
Given that we're in the context of a saga, we could probably retry. But this seems likely not great in the medium-to-long term. In even a remotely busy system, I think we can expect retries to fail with some frequency. In that case, some fraction of provision attempts will still fail (because they run out of retries) and for successful ones, provision latency could increase significantly (if they're doing a bunch of retries on average).
This general issue is discussed in RFD 192 and applies to most uses of interactive transactions. Eventually I think we want to phrase this in some way that allows us to issue the whole transaction to the database at once (e.g., what we did in #1688). This allows the database to issue a retry if needed, but also means the window for contention would be much, much, smaller.
Again, caveat: I have not examined this particular case closely and it's possible there's something else going on here.
See also #973.
With the recent VM lifecycle changes, the error now happens during instance start (when they all hit the same project for resource accounting):
saga_id = eee51974-c3c1-4883-a5b2-7c2672f87fdb
saga_name = instance-start
sec_id = 65a11c18-7f59-41ac-b9e7-680627f996e7
04:28:21.304Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): failed to start newly-created instance
error = InternalError { internal_message: "saga ACTION error at node \\"sled_id\\": unexpected database error: restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed preemptive refresh due to a conflict: committed value on key /Table/217/1/\\"_g \\\\xb8\\\\x8a1E\\\\xf8\\\\x8c\\\\x94\\\\x8ei\\\\x92\\\\x18\\\\xf2\\\\x8b\\"/0): \\"sql txn\\" meta={id=2e29c041 key=/Table/220/1/\\"\\\\xe5kQ\\\\xdd8WC\\\\xa3\\\\xa0\\\\xc1j\\\\xa8r\\\\x92\\\\f\\\\xe9\\"/0 pri=0.05421299 epo=0 ts=1698294501.245738365,1 min=1698294501.228055165,0 seq=2} lock=true stat=PENDING rts=1698294501.228055165,0 wto=false gul=1698294501.728055165,0" }
file = nexus/src/app/instance.rs:264
instance_id = a63eed60-a544-4ee2-af5d-dd6cc287574f
The frequency of instance-start error is much lower (around 10-15%) compared to concurrent instance creation. When provisioning 3 new instances with terraform, it used to hit at least one 500 error about 90% of the time.
Under very high concurrency (when provisioning 120 instances in a single TF plan and --parallelism
param set to 20), I ran into a different TransactionRetry error - TransactionAbortedError(ABORT_REASON_PUSHER_ABORTED)
. It was on rack2 which has all the transaction retry fixes.
root@[fd00:1122:3344:105::3]:32221/omicron> select * from saga_node_event where saga_id = '8670e14f-46bc-49fd-a7e0-b79a1aa21ec5';
saga_id | node_id | event_type | data | event_time | creator
---------------------------------------+---------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------+---------------------------------------
8670e14f-46bc-49fd-a7e0-b79a1aa21ec5 | 0 | started | NULL | 2023-12-14 07:00:23.083066+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
8670e14f-46bc-49fd-a7e0-b79a1aa21ec5 | 0 | succeeded | "e8b211d1-66c3-4dfc-a24d-b5e99ecd824e" | 2023-12-14 07:00:23.090158+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
8670e14f-46bc-49fd-a7e0-b79a1aa21ec5 | 0 | undo_finished | NULL | 2023-12-14 07:00:25.310563+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
8670e14f-46bc-49fd-a7e0-b79a1aa21ec5 | 0 | undo_started | NULL | 2023-12-14 07:00:25.29775+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
8670e14f-46bc-49fd-a7e0-b79a1aa21ec5 | 1 | failed | {"ActionFailed": {"source_error": {"InternalError": {"internal_message": "unexpected database error: restart transaction: TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_PUSHER_ABORTED): \\"sql txn\\" meta={id=9744414f key=/Table/220/1/\\"\\\\xe8\\\\xb2\\\\x11\\\\xd1f\\\\xc3M\\\\xfc\\\\xa2M\\\\xb5\\\\xe9\\\\x9e\xcd\x82N\\"/0 pri=0.03354754 epo=1 ts=1702537224.926508273,0 min=1702537224.822377260,0 seq=0} lock=true stat=ABORTED rts=1702537224.926508273,0 wto=false gul=1702537225.322377260,0"}}}} | 2023-12-14 07:00:25.168454+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
8670e14f-46bc-49fd-a7e0-b79a1aa21ec5 | 1 | started | NULL | 2023-12-14 07:00:23.100223+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
8670e14f-46bc-49fd-a7e0-b79a1aa21ec5 | 10 | started | NULL | 2023-12-14 07:00:23.069572+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
8670e14f-46bc-49fd-a7e0-b79a1aa21ec5 | 10 | succeeded | null | 2023-12-14 07:00:23.07599+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
8670e14f-46bc-49fd-a7e0-b79a1aa21ec5 | 10 | undo_finished | NULL | 2023-12-14 07:00:25.33197+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
8670e14f-46bc-49fd-a7e0-b79a1aa21ec5 | 10 | undo_started | NULL | 2023-12-14 07:00:25.323884+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
(10 rows)
I haven't been able to reproduce #3814 and #3786 in v5. We can consider closing some of the tickets or all of the TransactionRetry ones with a new ticket filed on the latest error (and linked to the closed ones to provide context).
I'm filing the issue here as the cockroach repo doesn't allow issues to be raised there.
When using terraform to create 3 instances concurrently, I hit the following error:
The error details as seen in Nexus log file:
I tried looking for more details in the crdb zone (files under /data/logs) but didn't find anything corresponding to the above error.
This is the TF script I used: