Open leftwo opened 8 months ago
Nexus log at: /staff/core/issues/omicron-5104/oxide-nexus:default.log
The actual error in Nexus is this:
01:44:35.487Z DEBG 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): saga log event
new_state = N001 failed
sec_id = 65a11c18-7f59-41ac-b9e7-680627f996e7
01:44:35.487Z DEBG 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): recording saga event
event_type = Failed(ActionFailed { source_error: Object {"InsufficientCapacity": Object {"message": Object {"external_message": String("No
sleds can fit the requested instance"), "internal_context": String("No sled targets found that had enough capacity to fit the requested instanc
e.")}}} })
node_id = 1
saga_id = 4a233e70-65e2-4d33-8dc5-8f33edcba746
01:44:35.489Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): update for saga cached state
Then, concluding with:
01:44:35.509Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): saga finished
file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/steno-0.4.0/src/sec.rs:996
saga_id = 4a233e70-65e2-4d33-8dc5-8f33edcba746
saga_name = instance-start
sec_id = 65a11c18-7f59-41ac-b9e7-680627f996e7
01:44:35.509Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (dropshot_external): request completed
error_message_external = Insufficient capacity: No sleds can fit the requested instance
error_message_internal = No sleds can fit the requested instance (with internal context: saga ACTION error at node "sled_id": No sled targe
ts found that had enough capacity to fit the requested instance.)
file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/711a749/dropshot/src/server.rs:837
latency_us = 136165
local_addr = 172.30.2.5:443
method = POST
remote_addr = 172.20.3.69:45840
req_id = bdabaf4b-b545-49e4-86d3-7ec8136b5112
response_code = 507
uri = //v1/instances/garrett-inst-100/start?project=alan
This isn't the quota check that's failing. As you said, there's plenty of virtual capacity available. The InsufficientCapacity
error here is originating from the sled_reservation_create
method in nexus/db-queries/src/db/datastore/sled.rs
.
There's a few things being checked:
For this error to happen, one or more of the above checks would need to fail for every sled.
I dumped the sled_resource
table, dragged it into google sheets, sorted by sled_id, then
totaled up the columns:
sled_id | kind | hardware_threads | rss_ram | reservoir_ram |
---|---|---|---|---|
0c7011f7-a4bf-4daf-90cc-1c2410103301 | total | 128 | 399431958528 | 90194313216 |
a2adea92-b56e-44fc-8a0d-7d63b5fd3b94 | total | 88 | 98784247808 | 240518168576 |
b886b58a-1e3f-4be1-b9f2-0c2e66c6bc89 | total | 94 | 47244640256 | 227633266688 |
db183874-65b5-4263-a1c1-ddb2737ae0e9 | total | 127 | 165356240896 | 483183820800 |
dd83e75a-1edf-4aa1-89a0-cd8b2091a7cd | total | 92 | 208305913856 | 90194313216 |
f15774c1-b8e5-434f-a493-ec43f96cba07 | total | 69 | 62277025792 | 156766306304 |
5f6720b8-8a31-45f8-8c94-8e699218f28b | total | 85 | 115964116992 | 186831077376 |
7b862eb6-7f50-4c2f-b9a6-0d12ac913d3c | total | 71 | 94489280512 | 146028888064 |
71def415-55ad-46b4-ba88-3ca55d7fb288 | total | 88 | 77309411328 | 352187318272 |
87c2c4fc-b0c7-4fef-a305-78f0ed265bbc | total | 92 | 367219703808 | 150323855360 |
2707b587-9c7f-4fb0-a7af-37c3b7a9a0fa | total | 98 | 83751862272 | 281320357888 |
So, yeah, I don't think there is a single sled where a 64 CPU instance could land.
Other than the slightly misleading (I think..) error 507 Insufficient Storage
, I'm not sure if there is an
actual bug here. The error does continue with Insufficient capacity: No sleds can fit the requested instance
,
which is a correct message.
It might be nice to have an easy way to reflect to the user, yes, there is the correct amount of resources available for your request, but they don't exist on a single sled.
Yeah, I agree with that. I also don't like the 507 Insufficient Storage. I think we were getting kind of clever with that error code but it's more confusing than anything. That was specifically designed to be used with WebDAV so it doesn't really make sense for our use anyway.
I guess there's something here to think about in terms of top level utilization. I was just going to show the summations of everything, but I suspect it'll also be good to get a sled breakdown for the cases when this bin packing problem crops up. I'll sleep on it.
On dogfood, while attempting to run the system out of resources, I encountered an unexpected failure.
I seem to have enough CPU, Memory, and Storage resources available:
However, when I attempt to start an instance:
My already created instance has a disk that is also already created: