Open iliana opened 8 months ago
An interesting observation: The IP pool we added was 172.20.28.20
-172.20.28.99
, so 80 addresses. If one instance has an ephemeral IP and the rest do not, the maximum number of instances that can be launched is 316, because there are 79 remaining IPs to use for SNAT, and each IP can currently support 4 instances.
It's possible that there was a significant amount of saga rollbacks happening due to InsufficientCapacity: No external IP addresses available
errors.
@faithanalog and I tried launching 513 instances on madrid. madrid is a 4-gimlet system, presently running v5 bits (4ae84726c2d2382fa643b29e04585727e204148f). Each instance was configured with 2 vCPUs, 1 GiB memory, and a 2 GiB disk, all created from the same image. These were launched with Terraform, which attempts to launch ten instances at a time, using this configuration:
Terraform configuration
```terraform terraform { required_providers { oxide = { source = "oxidecomputer/oxide" version = "0.1.0" } } } provider "oxide" {} variable "worker_count" { type = number default = 512 } variable "project_name" { type = string default = "yeet" } variable "control_image_name" { type = string default = "debian-12-genericcloud-amd64-20230910-1499" } variable "worker_image_name" { type = string default = "debian-12-genericcloud-amd64-20230910-1499" } data "oxide_project" "default" { name = var.project_name } data "oxide_vpc" "default" { project_name = data.oxide_project.default.name name = "default" } data "oxide_vpc_subnet" "default" { project_name = data.oxide_project.default.name vpc_name = data.oxide_vpc.default.name name = "default" } data "oxide_image" "control" { project_name = data.oxide_project.default.name name = var.control_image_name } data "oxide_image" "worker" { project_name = data.oxide_project.default.name name = var.worker_image_name } resource "oxide_disk" "control" { project_id = data.oxide_project.default.id name = "control" description = "control" size = data.oxide_image.control.size source_image_id = data.oxide_image.control.id } resource "oxide_instance" "control" { project_id = data.oxide_project.default.id name = "control" host_name = "control" description = "control" ncpus = 2 memory = 1024 * 1024 * 1024 disk_attachments = [oxide_disk.control.id] network_interfaces = [{ name = "control" description = "control" vpc_id = data.oxide_vpc.default.id subnet_id = data.oxide_vpc_subnet.default.id }] external_ips = [{ type = "ephemeral" }] } resource "oxide_disk" "worker" { count = var.worker_count project_id = data.oxide_project.default.id name = "worker-${count.index}" description = "worker-${count.index}" size = data.oxide_image.worker.size source_image_id = data.oxide_image.worker.id } resource "oxide_instance" "worker" { count = var.worker_count project_id = data.oxide_project.default.id name = "worker-${count.index}" host_name = "worker-${count.index}" description = "worker-${count.index}" ncpus = 2 memory = 1024 * 1024 * 1024 disk_attachments = [oxide_disk.worker[count.index].id] network_interfaces = [{ name = "control" description = "control" vpc_id = data.oxide_vpc.default.id subnet_id = data.oxide_vpc_subnet.default.id }] external_ips = [] } ```After some time this failed spectacularly. A number of instances were stuck in the
creating
state, and another instance was stuck in thestarting
state. These sagas were wedged in a non-done
state and never completed:All five of these
instance-create
sagas show their final event in the database at 00:56:19; the twoinstance-start
sagas show their final event one minute prior at 00:55:19.Eight seconds later, Nexus's connection to CockroachDB closes and it receives an unexpected EOF. Nexus panics.
When this Nexus crashed, these seven sagas remained stuck. This is not ideal behavior. I am not sure if we have any rate limiting in place at the moment, or where we might want to add or tune it. But also it seems particularly bad for sagas to become stuck if Nexus crashes. I couldn't find an open issue for that topic.
We're about to clean slate madrid and try and reproduce this; all the cores found on madrid, the zone bundle for the Nexus that crashed, and all of the CockroachDB logs can be found in
/staff/iliana/madrid-20231222
.