Open askfongjojo opened 9 months ago
In /data/crucible.json
, I saw this region with a tombstoned
status:
"f49d1dfe-cd14-406a-bf1d-93919cdee59d": {
"id": "f49d1dfe-cd14-406a-bf1d-93919cdee59d",
"state": "tombstoned",
"block_size": 512,
"extent_size": 131072,
"extent_count": 1024,
"encrypted": true,
"port_number": 19005,
"cert_pem": null,
"key_pem": null,
"root_pem": null
}
This dataset is still taking up space according to zfs list
. I'm not sure if this could be the issue.
This is the space that the control plane thinks is in use:
root@[fd00:1122:3344:105::3]:32221/omicron> select * from dataset where id = '23dca27d-c79b-4930-a817-392e8aeaa4c1';
id | time_created | time_modified | time_deleted | rcgen | pool_id | ip | port | kind | size_used
---------------------------------------+-------------------------------+-------------------------------+--------------+-------+--------------------------------------+-----------------------+-------+----------+---------------
23dca27d-c79b-4930-a817-392e8aeaa4c1 | 2023-08-30 18:59:10.758512+00 | 2023-08-30 18:59:10.758512+00 | NULL | 1 | 57650e05-36ff-4de8-865f-b9562bdb67f5 | fd00:1122:3344:105::e | 32345 | crucible | 642097610752
(1 row)
We know the control plane accounting won't match the physical space consumed but maybe it is off by too much in this case. The concern is that customers are going into run into this same problem soon.
This duplicates oxidecomputer/crucible#861 but is more about how we can prevent the out-of-space situation. I've updated the bug description to indicate that.
How does the control plane populate that "size_used" column on the dataset table? Crucible reserves more space for the region than the user requests, and I bet we never told the control plane about that.
Yeah, that's where the update is. I'll (or someone) will have to walk that back to where nexus figures out what size it is using for the dataset, then some way to have it use the same multiplier we use in crucible, keeping the two in sync
While attempting to get crucible agent back online, I realized that I missed digging deeper into why there were tomestoned regions. The "Dataset does not exist" error actually came from the same region/downstairs with the id d9968d58-5e4f-4349-aaf9-f88024ebf8b4
- the one that could not get created when the dataset was out of space. The disk state was turned into faulted
after reaching some timeout and the region was eventually tomestoned:
root@[fd00:1122:3344:105::3]:32221/omicron> select id, name, disk_state, volume_id, time_created, time_deleted from disk where volume_id in (select volume_id from region
where id = 'd9968d58-5e4f-4349-aaf9-f88024ebf8b4');
id | name | disk_state | volume_id | time_created | time_deleted
---------------------------------------+-------------------+------------+--------------------------------------+-------------------------------+---------------
993a2f7a-8bcf-4be5-ae2b-5ca1bfbf250c | sb-mysql-4c16g-12 | faulted | 9aeaf7e8-c7dc-4a3e-8db5-47099c36acef | 2023-10-05 23:51:20.837965+00 | NULL
23:51:42.956Z INFO crucible-agent (datafile): region d9968d58-5e4f-4349-aaf9-f88024ebf8b4 state: Requested -> Tombstoned
Normally upon the start of crucible agent SMF service, it destroys the tombstoned regions and cleans up all the artifacts. But the above downstairs was stuck in a situation that could never be cleaned up. The agent kept trying in vain to delete a region that didn't get created in the first place. We may need to teach crucible how to handle this situation though the root cause is still about control plane not understanding the real capacity.
The situation was made worse when crucible agent panicked and running sagas ended up creating a region in crucible while CRDB considered it gone. There are 3 such orphan regions in that crucible zone that did not have volume and region records because of the failed disk_create saga. Here is an example (region id: 65b01b8d-ccbf-4dbb-adf9-14eb5d41eeb0)
root@[fd00:1122:3344:105::3]:32221/omicron> select event_type,data from saga_node_event where saga_id = '59466dd9-aa07-4948-a95c-aab6de6e705f';
event_type | data
----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
started | NULL
succeeded | "b97593f4-616c-442c-999b-5210da2eab57"
undo_finished | NULL
undo_started | NULL
started | NULL
succeeded | "cb68ad8a-64ab-4ce6-bede-b60ddef9aed5"
undo_finished | NULL
undo_started | NULL
started | NULL
succeeded | {"block_size": "Traditional", "create_image_id": "cb6eb1e9-69fd-40ad-9373-83926f8b32d9", "create_snapshot_id": null, "identity": {"description": "cb6eb1e9-69fd-40ad-9373-83926f8b32d9 test instance ", "id": "b97593f4-616c-442c-999b-5210da2eab57", "name": "one-each-32c-128m", "time_created": "2023-09-03T05:29:10.293407Z", "time_deleted": null, "time_modified": "2023-09-03T05:29:10.293407Z"}, "pantry_address": null, "project_id": "5e49b6de-cb2d-438d-83af-95c415bbb901", "rcgen": 1, "runtime_state": {"attach_instance_id": null, "disk_state": "creating", "gen": 1, "time_updated": "2023-09-03T05:29:10.293406Z"}, "size": 858993459200, "slot": null, "volume_id": "cb68ad8a-64ab-4ce6-bede-b60ddef9aed5"}
undo_finished | NULL
undo_started | NULL
started | NULL
succeeded | [[{"identity": {"id": "23dca27d-c79b-4930-a817-392e8aeaa4c1", "time_created": "2023-08-30T18:59:10.758512Z", "time_modified": "2023-08-30T18:59:10.758512Z"}, "ip": "fd00:1122:3344:105::e", "kind": "Crucible", "pool_id": "57650e05-36ff-4de8-865f-b9562bdb67f5", "port": 32345, "rcgen": 1, "size_used": 858993459200, "time_deleted": null}, {"block_size": 512, "blocks_per_extent": 131072, "dataset_id": "23dca27d-c79b-4930-a817-392e8aeaa4c1", "extent_count": 12800, "identity": {"id": "65b01b8d-ccbf-4dbb-adf9-14eb5d41eeb0", "time_created": "2023-09-03T05:29:10.362952Z", "time_modified": "2023-09-03T05:29:10.362952Z"}, "volume_id": "cb68ad8a-64ab-4ce6-bede-b60ddef9aed5"}], [{"identity": {"id": "1876cdcf-b2e7-4b79-ad2e-67df716e1860", "time_created": "2023-08-30T18:59:10.758513Z", "time_modified": "2023-08-30T18:59:10.758513Z"}, "ip": "fd00:1122:3344:10a::8", "kind": "Crucible", "pool_id": "d4c6bdc6-5e99-4f6c-b57a-9bfcb9a76be4", "port": 32345, "rcgen": 1, "size_used": 966367641600, "time_deleted": null}, {"block_size": 512, "blocks_per_extent": 131072, "dataset_id": "1876cdcf-b2e7-4b79-ad2e-67df716e1860", "extent_count": 12800, "identity": {"id": "0d0bfa0f-a2b9-46b4-b8fc-a211c439611f", "time_created": "2023-09-03T05:29:10.362952Z", "time_modified": "2023-09-03T05:29:10.362952Z"}, "volume_id": "cb68ad8a-64ab-4ce6-bede-b60ddef9aed5"}], [{"identity": {"id": "4d20175a-588b-44b8-8b9c-b16c6c3a97a0", "time_created": "2023-08-30T18:59:10.758511Z", "time_modified": "2023-08-30T18:59:10.758511Z"}, "ip": "fd00:1122:3344:108::b", "kind": "Crucible", "pool_id": "a726cacd-fa35-4ed2-ade6-31ad928b24cb", "port": 32345, "rcgen": 1, "size_used": 2695091978240, "time_deleted": null}, {"block_size": 512, "blocks_per_extent": 131072, "dataset_id": "4d20175a-588b-44b8-8b9c-b16c6c3a97a0", "extent_count": 12800, "identity": {"id": "aad5cb12-f873-4d4c-b888-7db99bf0a186", "time_created": "2023-09-03T05:29:10.362952Z", "time_modified": "2023-09-03T05:29:10.362952Z"}, "volume_id": "cb68ad8a-64ab-4ce6-bede-b60ddef9aed5"}]]
undo_finished | NULL
undo_started | NULL
started | NULL
succeeded | null
undo_finished | NULL
undo_started | NULL
failed | {"ActionFailed": {"source_error": {"InternalError": {"internal_message": "Communication Error: error sending request for url (http://[fd00:1122:3344:108::b]:32345/crucible/0/regions): error trying to connect: tcp connect error: Connection refused (os error 146)"}}}}
started | NULL
started | NULL
succeeded | null
undo_finished | NULL
undo_started | NULL
(26 rows)
root@[fd00:1122:3344:105::3]:32221/omicron> select * from region where volume_id = 'cb68ad8a-64ab-4ce6-bede-b60ddef9aed5';
id | time_created | time_modified | dataset_id | volume_id | block_size | blocks_per_extent | extent_count
-----+--------------+---------------+------------+-----------+------------+-------------------+---------------
(0 rows)
Time: 2ms total (execution 2ms / network 0ms)
root@[fd00:1122:3344:105::3]:32221/omicron> select * from disk where volume_id = 'cb68ad8a-64ab-4ce6-bede-b60ddef9aed5';
id | name | description | time_created | time_modified | time_deleted | rcgen | project_id | volume_id | disk_state | attach_instance_id | state_generation | slot | time_state_updated | size_bytes | block_size | origin_snapshot | origin_image | pantry_address
---------------------------------------+-------------------+-----------------------------------------------------+-------------------------------+-------------------------------+-------------------------------+-------+--------------------------------------+--------------------------------------+------------+--------------------+------------------+------+-------------------------------+--------------+------------+-----------------+--------------------------------------+-----------------
b97593f4-616c-442c-999b-5210da2eab57 | one-each-32c-128m | cb6eb1e9-69fd-40ad-9373-83926f8b32d9 test instance | 2023-09-03 05:29:10.293407+00 | 2023-09-03 05:29:10.293407+00 | 2023-09-03 05:29:31.641912+00 | 1 | 5e49b6de-cb2d-438d-83af-95c415bbb901 | cb68ad8a-64ab-4ce6-bede-b60ddef9aed5 | destroyed | NULL | 1 | NULL | 2023-09-03 05:29:10.293406+00 | 858993459200 | 512 | NULL | cb6eb1e9-69fd-40ad-9373-83926f8b32d9 | NULL
(1 row)
As part of the fix of this ticket, we probably need some support tool to recalculate the space used, fix the values in the region table, purge orphan regions, and so on. The utility can then be used against our internal lab environments and customer racks.
For posterity: I manually deleted the entry for d9968d58-5e4f-4349-aaf9-f88024ebf8b4
from /data/crucible.json
and ran svcadm clear agent
for now to bring up the agent. This or other crucible zones that have orphan region issues may soon run into the "out of space" error again.
The problem crucible dataset is on rack2, in cubby 9. First, It hit an "out of space" error when it attempted to recreate one of the datasets:
In
/pool/ext/d1cb6b7d-2b92-4b7d-8a4d-551987f0277e/crypt/debug/oxz_crucible_23dca27d-c79b-4930-a817-392e8aeaa4c1/oxide-crucible-agent:default.log.1696555801
, I saw that a new region child dataset failed to be created:Subsequent retries failed with out of space error until 2023-10-05T23:51:43.014916396Z. After that, the error turned into "Dataset does not exist!":
But the dataset and mountpoint seemed to already exist and I was able to list the files under it: