oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
241 stars 36 forks source link

Nexus crashed after database connection EOF under extreme load; sagas wedged #4730

Open iliana opened 8 months ago

iliana commented 8 months ago

@faithanalog and I tried launching 513 instances on madrid. madrid is a 4-gimlet system, presently running v5 bits (4ae84726c2d2382fa643b29e04585727e204148f). Each instance was configured with 2 vCPUs, 1 GiB memory, and a 2 GiB disk, all created from the same image. These were launched with Terraform, which attempts to launch ten instances at a time, using this configuration:

Terraform configuration ```terraform terraform { required_providers { oxide = { source = "oxidecomputer/oxide" version = "0.1.0" } } } provider "oxide" {} variable "worker_count" { type = number default = 512 } variable "project_name" { type = string default = "yeet" } variable "control_image_name" { type = string default = "debian-12-genericcloud-amd64-20230910-1499" } variable "worker_image_name" { type = string default = "debian-12-genericcloud-amd64-20230910-1499" } data "oxide_project" "default" { name = var.project_name } data "oxide_vpc" "default" { project_name = data.oxide_project.default.name name = "default" } data "oxide_vpc_subnet" "default" { project_name = data.oxide_project.default.name vpc_name = data.oxide_vpc.default.name name = "default" } data "oxide_image" "control" { project_name = data.oxide_project.default.name name = var.control_image_name } data "oxide_image" "worker" { project_name = data.oxide_project.default.name name = var.worker_image_name } resource "oxide_disk" "control" { project_id = data.oxide_project.default.id name = "control" description = "control" size = data.oxide_image.control.size source_image_id = data.oxide_image.control.id } resource "oxide_instance" "control" { project_id = data.oxide_project.default.id name = "control" host_name = "control" description = "control" ncpus = 2 memory = 1024 * 1024 * 1024 disk_attachments = [oxide_disk.control.id] network_interfaces = [{ name = "control" description = "control" vpc_id = data.oxide_vpc.default.id subnet_id = data.oxide_vpc_subnet.default.id }] external_ips = [{ type = "ephemeral" }] } resource "oxide_disk" "worker" { count = var.worker_count project_id = data.oxide_project.default.id name = "worker-${count.index}" description = "worker-${count.index}" size = data.oxide_image.worker.size source_image_id = data.oxide_image.worker.id } resource "oxide_instance" "worker" { count = var.worker_count project_id = data.oxide_project.default.id name = "worker-${count.index}" host_name = "worker-${count.index}" description = "worker-${count.index}" ncpus = 2 memory = 1024 * 1024 * 1024 disk_attachments = [oxide_disk.worker[count.index].id] network_interfaces = [{ name = "control" description = "control" vpc_id = data.oxide_vpc.default.id subnet_id = data.oxide_vpc_subnet.default.id }] external_ips = [] } ```

After some time this failed spectacularly. A number of instances were stuck in the creating state, and another instance was stuck in the starting state. These sagas were wedged in a non-done state and never completed:

omicron=> select id, name, saga_state from saga where saga_state != 'done';
                  id                  |      name       | saga_state
--------------------------------------+-----------------+------------
 13cb04b4-9ba2-4d5c-9efd-7a6c7595682d | instance-create | unwinding
 158f42a0-af1d-48af-b421-9707b32d814f | instance-create | unwinding
 326a8ace-f1ee-4afa-8fda-492950bf3cfe | instance-create | unwinding
 5bbd297a-3ba1-4a65-bcb9-dea48c9ad74e | instance-start  | unwinding
 8afa9359-2ff6-467a-893f-59de792f8db3 | instance-create | unwinding
 e9f59e1e-1154-47fc-8237-b715b1bb4a8e | instance-start  | unwinding
 fc4a05b1-5003-46ab-a492-44a32ec3be08 | instance-create | running
(7 rows)

All five of these instance-create sagas show their final event in the database at 00:56:19; the two instance-start sagas show their final event one minute prior at 00:55:19.

root@oxz_switch:~# for saga in 13cb04b4-9ba2-4d5c-9efd-7a6c7595682d 158f42a0-af1d-48af-b421-9707b32d814f 326a8ace-f1ee-4afa-8fda-492950bf3cfe 5bbd297a-3ba1-4a65-bcb9-dea48c9ad74e 8afa9359-2ff6-467a-893f-59de792f8db3 e9f59e1e-1154-47fc-8237-b715b1bb4a8e fc4a05b1-5003-46ab-a492-44a32ec3be08; do /tmp/omdb db sagas show "$saga" 2>/dev/null | tail -n 1; done
13cb04b4-9ba2-4d5c-9efd-7a6c7595682d | 2023-12-21 00:56:19.593747 UTC | sub saga   1:   8: instance_create.create_network_interface                       | undo_finished |
158f42a0-af1d-48af-b421-9707b32d814f | 2023-12-21 00:56:19.580751 UTC |  35: instance_create.create_snat_ip                                               | failed     | {"ActionFailed":{"source_error":{"InsufficientCapacity":{"message":{"external_message":"No external IP addresses available","internal_context":"NextExternalIp::new returned NotFound"}}}}}
326a8ace-f1ee-4afa-8fda-492950bf3cfe | 2023-12-21 00:56:19.564268 UTC | sub saga   0:   4: instance_create.create_network_interface                       | undo_started  |
5bbd297a-3ba1-4a65-bcb9-dea48c9ad74e | 2023-12-21 00:55:19.453043 UTC |   7: instance_start.ensure_registered     | undo_started  |
8afa9359-2ff6-467a-893f-59de792f8db3 | 2023-12-21 00:56:19.589866 UTC | sub saga   4:  20: instance_create.create_network_interface                       | undo_started  |
e9f59e1e-1154-47fc-8237-b715b1bb4a8e | 2023-12-21 00:55:19.463633 UTC |   7: instance_start.ensure_registered     | undo_started  |
fc4a05b1-5003-46ab-a492-44a32ec3be08 | 2023-12-21 00:56:19.458817 UTC | sub saga   0:   4: instance_create.create_network_interface                       | started    |

Eight seconds later, Nexus's connection to CockroachDB closes and it receives an unexpected EOF. Nexus panics.

00:56:27.501Z ERRO 804f9ff7-0d45-465f-8820-ee0fc7c25286 (ServerContext): failed to write Failed instance state to DB
    error = InternalError { internal_message: "database error (kind = Unknown): result is ambiguous: error=rpc error: code = Unavailable desc = error reading from server: EOF [exhausted]\\n" }
    file = nexus/src/app/instance.rs:1256
    instance_id = 5b39e400-4bc6-45e3-b7e1-55bcd6c2e752
00:56:27.502Z DEBG 804f9ff7-0d45-465f-8820-ee0fc7c25286 (ServerContext): roles
    roles = RoleSet { roles: {} }
00:56:27.502Z DEBG 804f9ff7-0d45-465f-8820-ee0fc7c25286 (ServerContext): roles
    roles = RoleSet { roles: {} }
00:56:27.502Z DEBG 804f9ff7-0d45-465f-8820-ee0fc7c25286 (ServerContext): authorize result
    action = Query
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000003, .. })
    resource = Database
    result = Ok(())
00:56:27.502Z DEBG 804f9ff7-0d45-465f-8820-ee0fc7c25286 (ServerContext): roles
    roles = RoleSet { roles: {} }
00:56:27.502Z DEBG 804f9ff7-0d45-465f-8820-ee0fc7c25286 (ServerContext): roles
    actor_id = 3fa23e3a-1270-4fc8-bc86-664ad71c0c42
    authenticated = true
    roles = RoleSet { roles: {} }
    saga_id = 326a8ace-f1ee-4afa-8fda-492950bf3cfe
    saga_name = instance-create
    saga_node = CreateNetworkInterface0
00:56:27.502Z DEBG 804f9ff7-0d45-465f-8820-ee0fc7c25286 (ServerContext): roles
    roles = RoleSet { roles: {} }
00:56:27.502Z DEBG 804f9ff7-0d45-465f-8820-ee0fc7c25286 (ServerContext): authorize result
    action = Query
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000003, .. })
    resource = Database
    result = Ok(())
00:56:27.503Z DEBG 804f9ff7-0d45-465f-8820-ee0fc7c25286 (ServerContext): roles
    actor_id = 3fa23e3a-1270-4fc8-bc86-664ad71c0c42
    authenticated = true
    roles = RoleSet { roles: {} }
    saga_id = 8afa9359-2ff6-467a-893f-59de792f8db3
    saga_name = instance-create
    saga_node = CreateNetworkInterface4
00:56:27.503Z DEBG 804f9ff7-0d45-465f-8820-ee0fc7c25286 (ServerContext): authorize result
    action = Query
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000003, .. })
    resource = Database
    result = Ok(())
00:56:27.503Z DEBG 804f9ff7-0d45-465f-8820-ee0fc7c25286 (ServerContext): authorize result
    action = Query
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000003, .. })
    resource = Database
    result = Ok(())
00:56:27.503Z DEBG 804f9ff7-0d45-465f-8820-ee0fc7c25286 (ServerContext): authorize result
    action = Query
    actor = Some(Actor::SiloUser { silo_user_id: 3fa23e3a-1270-4fc8-bc86-664ad71c0c42, silo_id: bbb6576a-ec03-4fb1-befa-a613dac81589, .. })
    actor_id = 3fa23e3a-1270-4fc8-bc86-664ad71c0c42
    authenticated = true
    resource = Database
    result = Ok(())
    saga_id = 326a8ace-f1ee-4afa-8fda-492950bf3cfe
    saga_name = instance-create
    saga_node = CreateNetworkInterface0
00:56:27.503Z DEBG 804f9ff7-0d45-465f-8820-ee0fc7c25286 (ServerContext): authorize result
    action = Query
    actor = Some(Actor::SiloUser { silo_user_id: 3fa23e3a-1270-4fc8-bc86-664ad71c0c42, silo_id: bbb6576a-ec03-4fb1-befa-a613dac81589, .. })
    actor_id = 3fa23e3a-1270-4fc8-bc86-664ad71c0c42
    authenticated = true
    resource = Database
    result = Ok(())
    saga_id = 8afa9359-2ff6-467a-893f-59de792f8db3
    saga_name = instance-create
    saga_node = CreateNetworkInterface4
00:56:27.504Z DEBG 804f9ff7-0d45-465f-8820-ee0fc7c25286 (ServerContext): roles
    roles = RoleSet { roles: {(Fleet, 001de000-1334-4000-8000-000000000000, "external-authenticator")} }
00:56:27.504Z INFO 804f9ff7-0d45-465f-8820-ee0fc7c25286 (ServerContext): marked instance as Failed
    file = nexus/src/app/instance.rs:1252
    instance_id = d9f6d912-3136-4a2b-84af-970b8f44b139
00:56:27.504Z DEBG 804f9ff7-0d45-465f-8820-ee0fc7c25286 (ServerContext): roles
    roles = RoleSet { roles: {} }
00:56:27.504Z DEBG 804f9ff7-0d45-465f-8820-ee0fc7c25286 (ServerContext): roles
    roles = RoleSet { roles: {(Fleet, 001de000-1334-4000-8000-000000000000, "external-authenticator")} }
00:56:27.504Z DEBG 804f9ff7-0d45-465f-8820-ee0fc7c25286 (ServerContext): authorize result
    action = Query
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000003, .. })
    resource = Database
    result = Ok(())
00:56:27.504Z DEBG 804f9ff7-0d45-465f-8820-ee0fc7c25286 (ServerContext): roles
    roles = RoleSet { roles: {(Fleet, 001de000-1334-4000-8000-000000000000, "external-authenticator")} }
00:56:27.504Z DEBG 804f9ff7-0d45-465f-8820-ee0fc7c25286 (ServerContext): roles
    roles = RoleSet { roles: {(Fleet, 001de000-1334-4000-8000-000000000000, "external-authenticator")} }
00:56:27.504Z DEBG 804f9ff7-0d45-465f-8820-ee0fc7c25286 (ServerContext): roles
    roles = RoleSet { roles: {(Fleet, 001de000-1334-4000-8000-000000000000, "external-authenticator")} }
00:56:27.505Z DEBG 804f9ff7-0d45-465f-8820-ee0fc7c25286 (ServerContext): roles
    roles = RoleSet { roles: {} }
00:56:27.505Z DEBG 804f9ff7-0d45-465f-8820-ee0fc7c25286 (ServerContext): authorize result
    action = Query
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000003, .. })
    resource = Database
    result = Ok(())
thread 'tokio-runtime-worker' panicked at nexus/db-queries/src/db/sec_store.rs:65:60:
called `Result::unwrap()` on an `Err` value: InternalError { internal_message: "database error (kind = Unknown): result is ambiguous: error=rpc error: code = Unavailable desc = error reading from server: EOF [exhausted]\n" }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
00:56:27.506Z DEBG 804f9ff7-0d45-465f-8820-ee0fc7c25286 (ServerContext): roles
    actor_id = 3fa23e3a-1270-4fc8-bc86-664ad71c0c42
    authenticated = true
    roles = RoleSet { roles: {} }
    saga_id = 326a8ace-f1ee-4afa-8fda-492950bf3cfe
    saga_name = instance-create
    saga_node = CreateNetworkInterface0
[ Dec 21 00:56:37 Stopping because all processes in service exited. ]
[ Dec 21 00:56:37 Executing stop method (:kill). ]
[ Dec 21 00:56:37 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/nexus/bin/nexus /var/svc/manifest/site/nexus/config.toml &"). ]
[ Dec 21 00:56:37 Method "start" exited with status 0. ]

When this Nexus crashed, these seven sagas remained stuck. This is not ideal behavior. I am not sure if we have any rate limiting in place at the moment, or where we might want to add or tune it. But also it seems particularly bad for sagas to become stuck if Nexus crashes. I couldn't find an open issue for that topic.

We're about to clean slate madrid and try and reproduce this; all the cores found on madrid, the zone bundle for the Nexus that crashed, and all of the CockroachDB logs can be found in /staff/iliana/madrid-20231222.

iliana commented 8 months ago

An interesting observation: The IP pool we added was 172.20.28.20-172.20.28.99, so 80 addresses. If one instance has an ephemeral IP and the rest do not, the maximum number of instances that can be launched is 316, because there are 79 remaining IPs to use for SNAT, and each IP can currently support 4 instances.

It's possible that there was a significant amount of saga rollbacks happening due to InsufficientCapacity: No external IP addresses available errors.