Cannot start instance because `507 Insufficient Storage`

leftwo commented 8 months ago

On dogfood, while attempting to run the system out of resources, I encountered an unexpected failure.

I seem to have enough CPU, Memory, and Storage resources available:

alan@atrium:many-instances$ oxide utilization
{
  "capacity": {
    "cpus": 960,
    "memory": 8367025789338,
    "storage": 38654705664000
  },
  "provisioned": {
    "cpus": 723,
    "memory": 2057289334784,
    "storage": 16909286244352
  }
}

However, when I attempt to start an instance:

alan@atrium:many-instances$ oxide instance start --project alan --instance garrett-inst-100
error
Error Response: status: 507 Insufficient Storage; headers: {"content-type": "application/json", "x-request-id": "3665f612-8ecf-4500-8cd0-6304a80c5ec2", "content-length": "177", "date": "Tue, 20 Feb 2024 00:15:46 GMT"}; value: Error { error_code: Some("InsufficientCapacity"), message: "Insufficient capacity: No sleds can fit the requested instance", request_id: "3665f612-8ecf-4500-8cd0-6304a80c5ec2" }

My already created instance has a disk that is also already created:

alan@atrium:many-instances$ oxide instance view --project alan --instance garrett-inst-100
{
  "description": "loop host",
  "hostname": "fff-100",
  "id": "f8be046f-8bab-45f6-8c35-58de24b33099",
  "memory": 4294967296,
  "name": "garrett-inst-100",
  "ncpus": 64,
  "project_id": "759beaf2-517d-4d24-bc17-1eed69bc8801",
  "run_state": "stopped",
  "time_created": "2024-02-20T00:05:02.314662Z",
  "time_modified": "2024-02-20T00:05:02.314662Z",
  "time_run_state_updated": "2024-02-20T00:05:02.314662Z"
}

leftwo commented 8 months ago

Nexus log at: /staff/core/issues/omicron-5104/oxide-nexus:default.log

leftwo commented 8 months ago

The actual error in Nexus is this:

01:44:35.487Z DEBG 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): saga log event                                                        
    new_state = N001 failed                                                                                                                    
    sec_id = 65a11c18-7f59-41ac-b9e7-680627f996e7                                                                                              
01:44:35.487Z DEBG 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): recording saga event                                                  
    event_type = Failed(ActionFailed { source_error: Object {"InsufficientCapacity": Object {"message": Object {"external_message": String("No 
sleds can fit the requested instance"), "internal_context": String("No sled targets found that had enough capacity to fit the requested instanc
e.")}}} })                                                                                                                                     
    node_id = 1                                                                                                                                
    saga_id = 4a233e70-65e2-4d33-8dc5-8f33edcba746                                                                                             
01:44:35.489Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): update for saga cached state

Then, concluding with:

01:44:35.509Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): saga finished
    file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/steno-0.4.0/src/sec.rs:996
    saga_id = 4a233e70-65e2-4d33-8dc5-8f33edcba746
    saga_name = instance-start
    sec_id = 65a11c18-7f59-41ac-b9e7-680627f996e7
01:44:35.509Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (dropshot_external): request completed
    error_message_external = Insufficient capacity: No sleds can fit the requested instance
    error_message_internal = No sleds can fit the requested instance (with internal context: saga ACTION error at node "sled_id": No sled targe
ts found that had enough capacity to fit the requested instance.)
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/711a749/dropshot/src/server.rs:837
    latency_us = 136165
    local_addr = 172.30.2.5:443
    method = POST
    remote_addr = 172.20.3.69:45840
    req_id = bdabaf4b-b545-49e4-86d3-7ec8136b5112
    response_code = 507
    uri = //v1/instances/garrett-inst-100/start?project=alan

zephraph commented 8 months ago

This isn't the quota check that's failing. As you said, there's plenty of virtual capacity available. The InsufficientCapacity error here is originating from the sled_reservation_create method in nexus/db-queries/src/db/datastore/sled.rs.

There's a few things being checked:

sled_has_space_for_threads
sled_has_space_for_rss
sled_has_space_in_reservoir
sled is provisionable

For this error to happen, one or more of the above checks would need to fail for every sled.

leftwo commented 8 months ago

I dumped the sled_resource table, dragged it into google sheets, sorted by sled_id, then totaled up the columns:

sled_id	kind	hardware_threads	rss_ram	reservoir_ram
0c7011f7-a4bf-4daf-90cc-1c2410103301	total	128	399431958528	90194313216
a2adea92-b56e-44fc-8a0d-7d63b5fd3b94	total	88	98784247808	240518168576
b886b58a-1e3f-4be1-b9f2-0c2e66c6bc89	total	94	47244640256	227633266688
db183874-65b5-4263-a1c1-ddb2737ae0e9	total	127	165356240896	483183820800
dd83e75a-1edf-4aa1-89a0-cd8b2091a7cd	total	92	208305913856	90194313216
f15774c1-b8e5-434f-a493-ec43f96cba07	total	69	62277025792	156766306304
5f6720b8-8a31-45f8-8c94-8e699218f28b	total	85	115964116992	186831077376
7b862eb6-7f50-4c2f-b9a6-0d12ac913d3c	total	71	94489280512	146028888064
71def415-55ad-46b4-ba88-3ca55d7fb288	total	88	77309411328	352187318272
87c2c4fc-b0c7-4fef-a305-78f0ed265bbc	total	92	367219703808	150323855360
2707b587-9c7f-4fb0-a7af-37c3b7a9a0fa	total	98	83751862272	281320357888

So, yeah, I don't think there is a single sled where a 64 CPU instance could land.

leftwo commented 8 months ago

Other than the slightly misleading (I think..) error 507 Insufficient Storage, I'm not sure if there is an actual bug here. The error does continue with Insufficient capacity: No sleds can fit the requested instance, which is a correct message.

It might be nice to have an easy way to reflect to the user, yes, there is the correct amount of resources available for your request, but they don't exist on a single sled.

zephraph commented 8 months ago

Yeah, I agree with that. I also don't like the 507 Insufficient Storage. I think we were getting kind of clever with that error code but it's more confusing than anything. That was specifically designed to be used with WebDAV so it doesn't really make sense for our use anyway.

I guess there's something here to think about in terms of top level utilization. I was just going to show the summations of everything, but I suspect it'll also be good to get a sled breakdown for the cases when this bin packing problem crops up. I'll sleep on it.

oxidecomputer / omicron

Cannot start instance because `507 Insufficient Storage` #5104