oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
239 stars 34 forks source link

sled-agent does not create enough crucible zones #2407

Closed leftwo closed 1 month ago

leftwo commented 1 year ago

On the dogfood system (BRM42220070, gimlet in cubby 26) I've installed Omicron.

I have 10 U.2 SSDs and 2 M.2s.

BRM42220070 # diskinfo
TYPE    DISK                    VID      PID              SIZE          RMV SSD
NVME    c1t00A0750132753688d0   NVMe     Micron_7300_MTFDHBG1T9TDF 1788.50 GiB   no  yes
NVME    c2t0014EE81000D2E9Dd0   NVMe     WUS4C6432DSP3X3  2980.82 GiB   no  yes
NVME    c3t0014EE81000D2EEAd0   NVMe     WUS4C6432DSP3X3  2980.82 GiB   no  yes
NVME    c4t0014EE81000D3035d0   NVMe     WUS4C6432DSP3X3  2980.82 GiB   no  yes
NVME    c5t0014EE81000D2EE9d0   NVMe     WUS4C6432DSP3X3  2980.82 GiB   no  yes
NVME    c6t0014EE81000D2FFAd0   NVMe     WUS4C6432DSP3X3  2980.82 GiB   no  yes
NVME    c7t0014EE81000D2F5Bd0   NVMe     WUS4C6432DSP3X3  2980.82 GiB   no  yes
NVME    c8t00A0750132753657d0   NVMe     Micron_7300_MTFDHBG1T9TDF 1788.50 GiB   no  yes
NVME    c9t0014EE81000D307Fd0   NVMe     WUS4C6432DSP3X3  2980.82 GiB   no  yes
NVME    c10t0014EE81000D2E50d0  NVMe     WUS4C6432DSP3X3  2980.82 GiB   no  yes
NVME    c11t0014EE81000D2E4Dd0  NVMe     WUS4C6432DSP3X3  2980.82 GiB   no  yes
NVME    c12t0014EE81000D2E4Ad0  NVMe     WUS4C6432DSP3X3  2980.82 GiB   no  yes

I can see sled-agent has created 13 pools for me. One for the root pool, two on the M.2s, and 10 on the U.2s:

BRM42220070 # zpool list
NAME                                       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
oxp_18100d5d-c050-455f-af98-ab2943df8909  2.91T   216K  2.91T        -         -     0%     0%  1.00x    ONLINE  -
oxp_4484682d-6289-4b6a-b74e-f8950e06108f  2.91T   102K  2.91T        -         -     0%     0%  1.00x    ONLINE  -
oxp_6fa48e35-99ef-48fa-81db-f69cc03f03b4  2.91T   210K  2.91T        -         -     0%     0%  1.00x    ONLINE  -
oxp_857a4e84-1147-49ac-9c01-55202904f3c6  2.91T   104K  2.91T        -         -     0%     0%  1.00x    ONLINE  -
oxp_90643ca8-e019-4e08-a8e0-0109ecb705ec  2.91T   102K  2.91T        -         -     0%     0%  1.00x    ONLINE  -
oxp_a527b2b2-e35c-44e1-9eac-c1aeef3a1c9e  2.91T   102K  2.91T        -         -     0%     0%  1.00x    ONLINE  -
oxp_ac2cb3cf-6c91-4f47-8c7e-141784487f5a   748G  1.23G   747G        -         -     0%     0%  1.00x    ONLINE  -
oxp_b97c5f27-f29f-40a6-a4e6-0edbca57269e   748G   792K   748G        -         -     0%     0%  1.00x    ONLINE  -
oxp_bc8ddd59-2b88-43de-82e9-a9768182d9fc  2.91T   219K  2.91T        -         -     0%     0%  1.00x    ONLINE  -
oxp_ce466faa-3bb7-4c95-9663-36549a98c521  2.91T   644K  2.91T        -         -     0%     0%  1.00x    ONLINE  -
oxp_edfd169c-5846-4e24-a947-1b2367d1d859  2.91T   219K  2.91T        -         -     0%     0%  1.00x    ONLINE  -
oxp_f1d6fc67-a647-4995-ae1c-5dc0ec381d80  2.91T   218K  2.91T        -         -     0%     0%  1.00x    ONLINE  -
rpool                                     15.9G  3.90G  12.0G        -         -     4%    24%  1.00x    ONLINE  -

However, sled-agent itself only has 8 crucible zones:

BRM42220070 # zoneadm list
global
oxz_internal_dns
oxz_cockroachdb_oxp_ac2cb3cf-6c91-4f47-8c7e-141784487f5a
oxz_clickhouse_oxp_ac2cb3cf-6c91-4f47-8c7e-141784487f5a
oxz_crucible_oxp_ac2cb3cf-6c91-4f47-8c7e-141784487f5a
oxz_crucible_oxp_edfd169c-5846-4e24-a947-1b2367d1d859
oxz_crucible_oxp_18100d5d-c050-455f-af98-ab2943df8909
oxz_crucible_oxp_6fa48e35-99ef-48fa-81db-f69cc03f03b4
oxz_crucible_oxp_b97c5f27-f29f-40a6-a4e6-0edbca57269e
oxz_crucible_oxp_ce466faa-3bb7-4c95-9663-36549a98c521
oxz_crucible_oxp_f1d6fc67-a647-4995-ae1c-5dc0ec381d80
oxz_crucible_oxp_bc8ddd59-2b88-43de-82e9-a9768182d9fc
oxz_oximeter
oxz_nexus
oxz_crucible_pantry

Two of these are on the M.2 devices, so we really have six crucible zones out of 10 expected. For some reason the additional zones are not created.

I've attached the sled agent log. sled-agent.log

smklein commented 1 year ago

I think this is actually working as intended, even though it's weird. Lemme justify why, and tell me if you think I'm still wrong.

Evidence

Hypothesis

So, here's the order of events I think is happening:

  1. Sled Agent sees some, but not all disks on the system.
  2. RSS starts executing, queries for parsed disks, and creates a plan, because enough disks exist for a deployment.
  3. Sled Agent sees the rest of the attached disks on the system. NOTE: This is actually indistinguishable from a disk being hot-plugged later!

As a result, that zpool is initialized, but no crucible service is ever provisioned to it.

Who provisions Crucible?

Right now, only RSS is responsible for the provisioning of Crucible datasets. However, in the future, Nexus should maintain this responsibility, as documented by https://github.com/oxidecomputer/omicron/issues/732 .

So, right now, if it's not part of the original RSS plan, it never gets provisioned. Later, we'd like Nexus to notice these late-addition zpools, and provision Crucible datasets to them (among other services).

leftwo commented 1 year ago

So, in summary, this is not a bug, but a not yet finished part of Nexus. As long as that issue is tracked else where, we can probably close this.

Is there anything we could do on the dogfood system to sandbag RSS starting, to give the pools time to show up?

smklein commented 1 month ago

I believe this issue is fixed by the reconfigurator, which provisions crucible zones to all active physical disks in the control plane