oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
251 stars 40 forks source link

CRDB returned "RETRY_SERIALIZABLE - failed preemptive refresh due to a conflict" when creating 3 instances concurrently #3304

Open askfongjojo opened 1 year ago

askfongjojo commented 1 year ago

I'm filing the issue here as the cockroach repo doesn't allow issues to be raised there.

When using terraform to create 3 instances concurrently, I hit the following error:

$ terraform apply
...
Plan: 9 to add, 0 to change, 0 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

oxide_disk.boot[1]: Creating...
oxide_disk.data[1]: Creating...
oxide_disk.data[2]: Creating...
oxide_disk.boot[2]: Creating...
oxide_disk.boot[0]: Creating...
oxide_disk.data[0]: Creating...
oxide_disk.boot[2]: Creation complete after 1s [id=f5bf697b-abb8-4f37-aeb1-d92615a40e39]
oxide_disk.boot[1]: Creation complete after 1s [id=8b673f24-8d78-4a7f-8359-dd02e6cf6523]
oxide_disk.data[0]: Creation complete after 2s [id=05877d50-5704-4ceb-9f00-97188151c2fb]
oxide_disk.boot[0]: Creation complete after 2s [id=214d39de-831f-400a-9a4a-63aff15bf0d7]
oxide_disk.data[1]: Creation complete after 2s [id=fcaa726f-d0a1-4084-8598-f461feec4e8c]
oxide_disk.data[2]: Creation complete after 2s [id=9eacdd8b-1ff4-4cf2-9993-a248a44c0ed6]
oxide_instance.app_instance[1]: Creating...
oxide_instance.app_instance[2]: Creating...
oxide_instance.app_instance[0]: Creating...
oxide_instance.app_instance[2]: Creation complete after 5s [id=c237a828-8808-4ae6-a374-e0c136baa3ef]
oxide_instance.app_instance[1]: Creation complete after 5s [id=13ea6fea-7d99-49b2-a2ea-a1e5bc6fd843]
╷
│ Error: Error creating instance
│ 
│   with oxide_instance.app_instance[0],
│   on main.tf line 64, in resource "oxide_instance" "app_instance":
│   64: resource "oxide_instance" "app_instance" {
│ 
│ API error: HTTP 500 (https://recovery.sys.rack2.eng.oxide.computer/v1/instances?project=1ec35f61-0fb3-4fe7-a460-f930eff468f6) BODY -> {
│   "request_id": "d049a848-43cc-412e-8db0-cd84803ac7e5",
│   "error_code": "Internal",
│   "message": "Internal Server Error"
│ }
╵

The error details as seen in Nexus log file:

06:28:19.490Z INFO 612405c5-68bb-41a8-905c-e729dfd7514d (dropshot_external): request completed
    error_message_external = Internal Server Error
    error_message_internal = saga ACTION error at node "server_id": unexpected database error: restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed preemptive refresh due to a conflict: committed value on key /Table/112/2/"\\xf3\\xafw\u{1c2}bK\u{413}\\xbc\\xed\\xcc_\\xbe\\xf8\\xbb"/"\\xd1JT\\xe2\\xb9\\"E\\x02\\xa0U\\xb2\\x85\\xed\\xe6\\x1dl"/0): "sql txn" meta={id=dd446363 key=/Table/112/1/"\u{571}x'c\\xf4E\\x18\\x85/U\\xff\u{28a}l\\xdf"/0 pri=0.01256657 epo=0 ts=1685946499.451418574,1 min=1685946499.422006755,0 seq=2} lock=true stat=PENDING rts=1685946499.422006755,0 wto=false gul=1685946499.922006755,0
    local_addr = 172.30.1.5:443
    method = POST
    remote_addr = 172.20.17.42:56786
    req_id = d049a848-43cc-412e-8db0-cd84803ac7e5
    response_code = 500
    uri = https://recovery.sys.rack2.eng.oxide.computer/v1/instances?project=1ec35f61-0fb3-4fe7-a460-f930eff468f6

I tried looking for more details in the crdb zone (files under /data/logs) but didn't find anything corresponding to the above error.

This is the TF script I used:

terraform {
  required_version = ">= 1.0"

  required_providers {
    oxide = {
      source  = "oxidecomputer/oxide"
      version = "0.1.0-dev"
    }
  }
}

provider "oxide" {}

data "oxide_project" "target" {
  name = "try-tf"
  timeouts = {
    read = "1m"
  }
}

data "oxide_image" "focal" {
  project_name = data.oxide_project.target.name
  name         = "ubuntu20"
  timeouts = {
    read = "1m"
  }
}

data "oxide_vpc" "default" {
  project_name = data.oxide_project.target.name
  name         = "default"
  timeouts = {
    read = "1m"
  }
}

data "oxide_vpc_subnet" "default" {
  project_name = data.oxide_project.target.name
  name         = "default"
  vpc_name     = data.oxide_vpc.default.name
  timeouts = {
    read = "1m"
  }
}

resource "oxide_disk" "boot" {
  project_id      = data.oxide_project.target.id
  description     = "app boot disk"
  name            = "app-boot-${count.index}"
  size            = var.ten_gib
  source_image_id = data.oxide_image.focal.id
  count           = 3
}

resource "oxide_disk" "data" {
  project_id      = data.oxide_project.target.id
  description     = "app data disk"
  name            = "app-data-${count.index}"
  size            = var.twenty_gib
  block_size      = 4096
  count           = 3
}

resource "oxide_instance" "app_instance" {
  project_id        = data.oxide_project.target.id
  description       = "application instance"
  name              = "app-instance-${count.index}"
  host_name         = "app${count.index}"
  memory            = var.one_gib
  ncpus             = 2
  count             = 3
  disk_attachments  = [ oxide_disk.boot[count.index].id, oxide_disk.data[count.index].id ]
  external_ips      = [
    {
      ip_pool_name  = "default"
      type          = "ephemeral"
    }
  ]
  user_data         = filebase64("./app-ci.yaml")
  start_on_create   = false
  network_interfaces = [
    {
      subnet_id   = data.oxide_vpc_subnet.default.id
      vpc_id      = data.oxide_vpc.default.id
      description = "net0"
      name        = "net0"
    }
  ]
  timeouts = {
    read   = "1m"
    create = "3m"
    delete = "2m"
  }

}
davepacheco commented 1 year ago

I have not examined this in detail. But it looks similar to oxidecomputer/remote-access-preview#23, which was resolved by #1688 (which amounted to simplifying the database transaction to avoid so many round-trips between Nexus and CockroachDB during which other transactions could come in and invalidate state).

It looks like in this case the query is this one: https://github.com/oxidecomputer/omicron/blob/f8c5a21054ff4f50ce8344699e6af7885a3d2554/nexus/db-queries/src/db/datastore/sled.rs#L90-L189

which is not nearly so complicated and only appears to involve two round-trips. It essentially seems to be saying: select a sled on which to put a reservation, then insert the reservation. Although this is one transaction, that doesn't mean that two concurrent executions are queued in the database. Instead, I think the database is free to just fail one of them with exactly this sort of error: https://www.cockroachlabs.com/docs/stable/transactions.html#transaction-retries

Given that we're in the context of a saga, we could probably retry. But this seems likely not great in the medium-to-long term. In even a remotely busy system, I think we can expect retries to fail with some frequency. In that case, some fraction of provision attempts will still fail (because they run out of retries) and for successful ones, provision latency could increase significantly (if they're doing a bunch of retries on average).

This general issue is discussed in RFD 192 and applies to most uses of interactive transactions. Eventually I think we want to phrase this in some way that allows us to issue the whole transaction to the database at once (e.g., what we did in #1688). This allows the database to issue a retry if needed, but also means the window for contention would be much, much, smaller.

Again, caveat: I have not examined this particular case closely and it's possible there's something else going on here.

See also #973.

askfongjojo commented 1 year ago

With the recent VM lifecycle changes, the error now happens during instance start (when they all hit the same project for resource accounting):

    saga_id = eee51974-c3c1-4883-a5b2-7c2672f87fdb
    saga_name = instance-start
    sec_id = 65a11c18-7f59-41ac-b9e7-680627f996e7
04:28:21.304Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): failed to start newly-created instance
    error = InternalError { internal_message: "saga ACTION error at node \\"sled_id\\": unexpected database error: restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed preemptive refresh due to a conflict: committed value on key /Table/217/1/\\"_g \\\\xb8\\\\x8a1E\\\\xf8\\\\x8c\\\\x94\\\\x8ei\\\\x92\\\\x18\\\\xf2\\\\x8b\\"/0): \\"sql txn\\" meta={id=2e29c041 key=/Table/220/1/\\"\\\\xe5kQ\\\\xdd8WC\\\\xa3\\\\xa0\\\\xc1j\\\\xa8r\\\\x92\\\\f\\\\xe9\\"/0 pri=0.05421299 epo=0 ts=1698294501.245738365,1 min=1698294501.228055165,0 seq=2} lock=true stat=PENDING rts=1698294501.228055165,0 wto=false gul=1698294501.728055165,0" }
    file = nexus/src/app/instance.rs:264
    instance_id = a63eed60-a544-4ee2-af5d-dd6cc287574f

The frequency of instance-start error is much lower (around 10-15%) compared to concurrent instance creation. When provisioning 3 new instances with terraform, it used to hit at least one 500 error about 90% of the time.

askfongjojo commented 10 months ago

Under very high concurrency (when provisioning 120 instances in a single TF plan and --parallelism param set to 20), I ran into a different TransactionRetry error - TransactionAbortedError(ABORT_REASON_PUSHER_ABORTED). It was on rack2 which has all the transaction retry fixes.

root@[fd00:1122:3344:105::3]:32221/omicron> select * from saga_node_event where saga_id = '8670e14f-46bc-49fd-a7e0-b79a1aa21ec5';
                saga_id                | node_id |  event_type   |                                                                                                                                                                                                                                                                      data                                                                                                                                                                                                                                                                       |          event_time           |               creator
---------------------------------------+---------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------+---------------------------------------
  8670e14f-46bc-49fd-a7e0-b79a1aa21ec5 |       0 | started       | NULL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | 2023-12-14 07:00:23.083066+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
  8670e14f-46bc-49fd-a7e0-b79a1aa21ec5 |       0 | succeeded     | "e8b211d1-66c3-4dfc-a24d-b5e99ecd824e"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | 2023-12-14 07:00:23.090158+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
  8670e14f-46bc-49fd-a7e0-b79a1aa21ec5 |       0 | undo_finished | NULL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | 2023-12-14 07:00:25.310563+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
  8670e14f-46bc-49fd-a7e0-b79a1aa21ec5 |       0 | undo_started  | NULL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | 2023-12-14 07:00:25.29775+00  | 65a11c18-7f59-41ac-b9e7-680627f996e7
  8670e14f-46bc-49fd-a7e0-b79a1aa21ec5 |       1 | failed        | {"ActionFailed": {"source_error": {"InternalError": {"internal_message": "unexpected database error: restart transaction: TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_PUSHER_ABORTED): \\"sql txn\\" meta={id=9744414f key=/Table/220/1/\\"\\\\xe8\\\\xb2\\\\x11\\\\xd1f\\\\xc3M\\\\xfc\\\\xa2M\\\\xb5\\\\xe9\\\\x9e\xcd\x82N\\"/0 pri=0.03354754 epo=1 ts=1702537224.926508273,0 min=1702537224.822377260,0 seq=0} lock=true stat=ABORTED rts=1702537224.926508273,0 wto=false gul=1702537225.322377260,0"}}}} | 2023-12-14 07:00:25.168454+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
  8670e14f-46bc-49fd-a7e0-b79a1aa21ec5 |       1 | started       | NULL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | 2023-12-14 07:00:23.100223+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
  8670e14f-46bc-49fd-a7e0-b79a1aa21ec5 |      10 | started       | NULL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | 2023-12-14 07:00:23.069572+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
  8670e14f-46bc-49fd-a7e0-b79a1aa21ec5 |      10 | succeeded     | null                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | 2023-12-14 07:00:23.07599+00  | 65a11c18-7f59-41ac-b9e7-680627f996e7
  8670e14f-46bc-49fd-a7e0-b79a1aa21ec5 |      10 | undo_finished | NULL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | 2023-12-14 07:00:25.33197+00  | 65a11c18-7f59-41ac-b9e7-680627f996e7
  8670e14f-46bc-49fd-a7e0-b79a1aa21ec5 |      10 | undo_started  | NULL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | 2023-12-14 07:00:25.323884+00 | 65a11c18-7f59-41ac-b9e7-680627f996e7
(10 rows)

I haven't been able to reproduce #3814 and #3786 in v5. We can consider closing some of the tickets or all of the TransactionRetry ones with a new ticket filed on the latest error (and linked to the closed ones to provide context).