oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
250 stars 39 forks source link

Cannot start instance because `507 Insufficient Storage` #5104

Open leftwo opened 8 months ago

leftwo commented 8 months ago

On dogfood, while attempting to run the system out of resources, I encountered an unexpected failure.

I seem to have enough CPU, Memory, and Storage resources available:

alan@atrium:many-instances$ oxide utilization
{
  "capacity": {
    "cpus": 960,
    "memory": 8367025789338,
    "storage": 38654705664000
  },
  "provisioned": {
    "cpus": 723,
    "memory": 2057289334784,
    "storage": 16909286244352
  }
}

However, when I attempt to start an instance:

alan@atrium:many-instances$ oxide instance start --project alan --instance garrett-inst-100
error
Error Response: status: 507 Insufficient Storage; headers: {"content-type": "application/json", "x-request-id": "3665f612-8ecf-4500-8cd0-6304a80c5ec2", "content-length": "177", "date": "Tue, 20 Feb 2024 00:15:46 GMT"}; value: Error { error_code: Some("InsufficientCapacity"), message: "Insufficient capacity: No sleds can fit the requested instance", request_id: "3665f612-8ecf-4500-8cd0-6304a80c5ec2" }

My already created instance has a disk that is also already created:

alan@atrium:many-instances$ oxide instance view --project alan --instance garrett-inst-100
{
  "description": "loop host",
  "hostname": "fff-100",
  "id": "f8be046f-8bab-45f6-8c35-58de24b33099",
  "memory": 4294967296,
  "name": "garrett-inst-100",
  "ncpus": 64,
  "project_id": "759beaf2-517d-4d24-bc17-1eed69bc8801",
  "run_state": "stopped",
  "time_created": "2024-02-20T00:05:02.314662Z",
  "time_modified": "2024-02-20T00:05:02.314662Z",
  "time_run_state_updated": "2024-02-20T00:05:02.314662Z"
}
leftwo commented 8 months ago

Nexus log at: /staff/core/issues/omicron-5104/oxide-nexus:default.log

leftwo commented 8 months ago

The actual error in Nexus is this:

01:44:35.487Z DEBG 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): saga log event                                                        
    new_state = N001 failed                                                                                                                    
    sec_id = 65a11c18-7f59-41ac-b9e7-680627f996e7                                                                                              
01:44:35.487Z DEBG 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): recording saga event                                                  
    event_type = Failed(ActionFailed { source_error: Object {"InsufficientCapacity": Object {"message": Object {"external_message": String("No 
sleds can fit the requested instance"), "internal_context": String("No sled targets found that had enough capacity to fit the requested instanc
e.")}}} })                                                                                                                                     
    node_id = 1                                                                                                                                
    saga_id = 4a233e70-65e2-4d33-8dc5-8f33edcba746                                                                                             
01:44:35.489Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): update for saga cached state

Then, concluding with:

01:44:35.509Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): saga finished
    file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/steno-0.4.0/src/sec.rs:996
    saga_id = 4a233e70-65e2-4d33-8dc5-8f33edcba746
    saga_name = instance-start
    sec_id = 65a11c18-7f59-41ac-b9e7-680627f996e7
01:44:35.509Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (dropshot_external): request completed
    error_message_external = Insufficient capacity: No sleds can fit the requested instance
    error_message_internal = No sleds can fit the requested instance (with internal context: saga ACTION error at node "sled_id": No sled targe
ts found that had enough capacity to fit the requested instance.)
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/711a749/dropshot/src/server.rs:837
    latency_us = 136165
    local_addr = 172.30.2.5:443
    method = POST
    remote_addr = 172.20.3.69:45840
    req_id = bdabaf4b-b545-49e4-86d3-7ec8136b5112
    response_code = 507
    uri = //v1/instances/garrett-inst-100/start?project=alan
zephraph commented 8 months ago

This isn't the quota check that's failing. As you said, there's plenty of virtual capacity available. The InsufficientCapacity error here is originating from the sled_reservation_create method in nexus/db-queries/src/db/datastore/sled.rs.

There's a few things being checked:

For this error to happen, one or more of the above checks would need to fail for every sled.

leftwo commented 8 months ago

I dumped the sled_resource table, dragged it into google sheets, sorted by sled_id, then totaled up the columns:

sled_id kind hardware_threads rss_ram reservoir_ram
0c7011f7-a4bf-4daf-90cc-1c2410103301 total 128 399431958528 90194313216
a2adea92-b56e-44fc-8a0d-7d63b5fd3b94 total 88 98784247808 240518168576
b886b58a-1e3f-4be1-b9f2-0c2e66c6bc89 total 94 47244640256 227633266688
db183874-65b5-4263-a1c1-ddb2737ae0e9 total 127 165356240896 483183820800
dd83e75a-1edf-4aa1-89a0-cd8b2091a7cd total 92 208305913856 90194313216
f15774c1-b8e5-434f-a493-ec43f96cba07 total 69 62277025792 156766306304
5f6720b8-8a31-45f8-8c94-8e699218f28b total 85 115964116992 186831077376
7b862eb6-7f50-4c2f-b9a6-0d12ac913d3c total 71 94489280512 146028888064
71def415-55ad-46b4-ba88-3ca55d7fb288 total 88 77309411328 352187318272
87c2c4fc-b0c7-4fef-a305-78f0ed265bbc total 92 367219703808 150323855360
2707b587-9c7f-4fb0-a7af-37c3b7a9a0fa total 98 83751862272 281320357888

So, yeah, I don't think there is a single sled where a 64 CPU instance could land.

leftwo commented 8 months ago

Other than the slightly misleading (I think..) error 507 Insufficient Storage, I'm not sure if there is an actual bug here. The error does continue with Insufficient capacity: No sleds can fit the requested instance, which is a correct message.

It might be nice to have an easy way to reflect to the user, yes, there is the correct amount of resources available for your request, but they don't exist on a single sled.

zephraph commented 8 months ago

Yeah, I agree with that. I also don't like the 507 Insufficient Storage. I think we were getting kind of clever with that error code but it's more confusing than anything. That was specifically designed to be used with WebDAV so it doesn't really make sense for our use anyway.

I guess there's something here to think about in terms of top level utilization. I was just going to show the summations of everything, but I suspect it'll also be good to get a sled breakdown for the cases when this bin packing problem crops up. I'll sleep on it.