oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
239 stars 34 forks source link

system using only 8 of 10 U.2s on madrid BRM42220004 #5128

Open davepacheco opened 6 months ago

davepacheco commented 6 months ago

This was originally observed under #5111 (where this sled went through the "add sled" flow). But we also found that after a fresh install of Omicron that included BRM42220004 (without the "add sled" flow), the same issue happened and there were only 8 Crucible zones on it.

The errors are:

00:26:53.592Z ERRO SledAgent (StorageManager): Persistent error:not queueing disk
    disk_id = DiskIdentity { vendor: "1b96", serial: "A079E3F8", model: "WUS4C6432DSP3X3" }
    err = PooledDisk(BadPartitionLayout { path: "/devices/pci@ab,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC440,0:wd,raw", why: "Expected 0 partitions, only saw 1" })
    file = sled-storage/src/manager.rs:472
00:26:54.179Z ERRO SledAgent (StorageManager): Persistent error:not queueing disk
    disk_id = DiskIdentity { vendor: "1b96", serial: "A079DE8D", model: "WUS4C6432DSP3X3" }
    err = PooledDisk(ZpoolCreate(CreateError { err: Execution(CommandFailure(CommandFailureInfo { command: "/usr/sbin/zpool create oxp_d6149e62-ae84-4209-b943-4053fb9a8713 /devices/pci@38,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC129,0:a", status: ExitStatus(unix_wait_status(256)), stdout: "", stderr: "cannot open '/devices/pci@38,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC129,0:a': No such device or address\\n" })) }))
    file = sled-storage/src/manager.rs:472
00:28:50.331Z ERRO SledAgent (StorageManager): Persistent error:not queueing disk
    disk_id = DiskIdentity { vendor: "1b96", serial: "A079DE8D", model: "WUS4C6432DSP3X3" }
    err = PooledDisk(ZpoolCreate(CreateError { err: Execution(CommandFailure(CommandFailureInfo { command: "/usr/sbin/zpool create oxp_6ad8e920-ad0c-4629-8208-19dbf938a354 /devices/pci@38,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC129,0:a", status: ExitStatus(unix_wait_status(256)), stdout: "", stderr: "cannot open '/devices/pci@38,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC129,0:a': No such device or address\\n" })) }))
    file = sled-storage/src/manager.rs:472
00:28:50.605Z ERRO SledAgent (StorageManager): Persistent error:not queueing disk
    disk_id = DiskIdentity { vendor: "1b96", serial: "A079E3F8", model: "WUS4C6432DSP3X3" }
    err = PooledDisk(BadPartitionLayout { path: "/devices/pci@ab,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC440,0:wd,raw", why: "Expected 0 partitions, only saw 1" })
    file = sled-storage/src/manager.rs:472
smklein commented 6 months ago

The call-stack is coming from here:

parse_partition_types: https://github.com/oxidecomputer/omicron/blob/65ebf72288b16467dc6457bb4c3bfa90830000db/sled-hardware/src/illumos/partitions.rs#L46-L52

Which is called for U.2s here: https://github.com/oxidecomputer/omicron/blob/65ebf72288b16467dc6457bb4c3bfa90830000db/sled-hardware/src/illumos/partitions.rs#L147-L150

For U.2s, we only expect a single partition: the one holding the ZFS Zpool: https://github.com/oxidecomputer/omicron/blob/65ebf72288b16467dc6457bb4c3bfa90830000db/sled-hardware/src/illumos/partitions.rs#L37

Given the surrounding context in internal_ensure_partition_layout, here's what I suspect is happening:

  1. Sled Agent sees a raw disk from libdevinfo. It's a U.2
  2. Sled Agent tries to ensure this disk has a GPT with the right partitions. First, it checks if the GPT exists.
  3. AFAICT, the GPT does exist. I think we're taking this pathway: https://github.com/oxidecomputer/omicron/blob/65ebf72288b16467dc6457bb4c3bfa90830000db/sled-hardware/src/illumos/partitions.rs#L101-L105
  4. This means we aren't trying to write anything. It just checks that "oh, someone already made the GPT, let's just see if it's formatted okay".
  5. It's not. We bail.

So, to summarize:

jgallagher commented 6 months ago

3. AFAICT, the GPT does exist. I think we're taking this pathway: https://github.com/oxidecomputer/omicron/blob/65ebf72288b16467dc6457bb4c3bfa90830000db/sled-hardware/src/illumos/partitions.rs#L101-L105

We are indeed in this path. From the same log file, just before the errors above:

00:26:53.592Z INFO SledAgent (StorageManager): Disk at /devices/pci@ab,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC440,0 already has a GPT
    file = sled-hardware/src/illumos/partitions.rs:103
00:28:50.605Z INFO SledAgent (StorageManager): Disk at /devices/pci@ab,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC440,0 already has a GPT
    file = sled-hardware/src/illumos/partitions.rs:103
smklein commented 6 months ago

Here's what I got from MDB, poking at one of these devices:

BRM42220004 # mdb /dev/rdsk/c10t0014EE81000BC440d0p0
> ::load disk_label
> 
> ::help gpt

NAME
  gpt - dump an EFI GPT

SYNOPSIS
  [ addr ] ::gpt [-ag]

DESCRIPTION
  Display an EFI GUID Partition Table.

  -a Display the alternate GPT
  -g Show unique GUID for each table entry

ATTRIBUTES

  Target: raw
  Module: disk_label
  Interface Stability: Unstable
>
> ::gpt
Signature: EFI PART (valid)
Revision: 1.0
HeaderSize: 92 bytes
HeaderCRC32: 0x2bd8ebf7 (should be 0x2bd8ebf7)
Reserved1: 0 (should be 0x0)
MyLBA: 1 (should be 1)
AlternateLBA: 6251233967
FirstUsableLBA: 34
LastUsableLBA: 6251233934
DiskGUID: 6601dc84-5899-e48f-877b-a768969d4f59
PartitionEntryLBA: 2
NumberOfPartitionEntries: 0
SizeOfPartitionEntry: 0x80 bytes
PartitionEntryArrayCRC32: 0 (should be 0)

PART TYPE                STARTLBA      ENDLBA        ATTR     NAME
> ::gpt -a
Signature: EFI PART (valid)
Revision: 1.0
HeaderSize: 92 bytes
HeaderCRC32: 0x603f95a (should be 0x603f95a)
Reserved1: 0 (should be 0x0)
MyLBA: 6251233967 (should be 6251233967)
AlternateLBA: 1
FirstUsableLBA: 34
LastUsableLBA: 6251233934
DiskGUID: 6601dc84-5899-e48f-877b-a768969d4f59
PartitionEntryLBA: 6251233935
NumberOfPartitionEntries: 0
SizeOfPartitionEntry: 0x80 bytes
PartitionEntryArrayCRC32: 0 (should be 0)

PART TYPE                STARTLBA      ENDLBA        ATTR     NAME
smklein commented 6 months ago

Okay, I'm able to access some raw blocks from this device with MDB. Here we go!

Note: Our block size is 512, which in hex is 0x200.

BRM42220004 # mdb /dev/rdsk/c10t0014EE81000BC440d0p0
> ::load disk_label

LBA 0:
> 0::dump -f -l 0x200
      \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
000:  00000000 00000000 00000000 00000000  ................
010:  00000000 00000000 00000000 00000000  ................
020:  00000000 00000000 00000000 00000000  ................
030:  00000000 00000000 00000000 00000000  ................
040:  00000000 00000000 00000000 00000000  ................
050:  00000000 00000000 00000000 00000000  ................
060:  00000000 00000000 00000000 00000000  ................
070:  00000000 00000000 00000000 00000000  ................
080:  00000000 00000000 00000000 00000000  ................
090:  00000000 00000000 00000000 00000000  ................
0a0:  00000000 00000000 00000000 00000000  ................
0b0:  00000000 00000000 00000000 00000000  ................
0c0:  00000000 00000000 00000000 00000000  ................
0d0:  00000000 00000000 00000000 00000000  ................
0e0:  00000000 00000000 00000000 00000000  ................
0f0:  00000000 00000000 00000000 00000000  ................
100:  00000000 00000000 00000000 00000000  ................
110:  00000000 00000000 00000000 00000000  ................
120:  00000000 00000000 00000000 00000000  ................
130:  00000000 00000000 00000000 00000000  ................
140:  00000000 00000000 00000000 00000000  ................
150:  00000000 00000000 00000000 00000000  ................
160:  00000000 00000000 00000000 00000000  ................
170:  00000000 00000000 00000000 00000000  ................
180:  00000000 00000000 00000000 00000000  ................
190:  00000000 00000000 00000000 00000000  ................
1a0:  00000000 00000000 00000000 00000000  ................
1b0:  00000000 00000000 00000000 00000000  ................
1c0:  0200eeff ffff0100 0000ffff ffff0000  ................
1d0:  00000000 00000000 00000000 00000000  ................
1e0:  00000000 00000000 00000000 00000000  ................
1f0:  00000000 00000000 00000000 000055aa  ..............U.

LBA 1: The GPT  table itself
> 0x200::dump -f -l 0x200
      \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
200:  45464920 50415254 00000100 5c000000  EFI PART....\...
210:  f7ebd82b 00000000 01000000 00000000  ...+............
220:  af429a74 01000000 22000000 00000000  .B.t....".......
230:  8e429a74 01000000 6601dc84 5899e48f  .B.t....f...X...
240:  877ba768 969d4f59 02000000 00000000  .{.h..OY........
250:  00000000 80000000 00000000 00000000  ................
260:  00000000 00000000 00000000 00000000  ................
270:  00000000 00000000 00000000 00000000  ................
280:  00000000 00000000 00000000 00000000  ................
290:  00000000 00000000 00000000 00000000  ................
2a0:  00000000 00000000 00000000 00000000  ................
2b0:  00000000 00000000 00000000 00000000  ................
2c0:  00000000 00000000 00000000 00000000  ................
2d0:  00000000 00000000 00000000 00000000  ................
2e0:  00000000 00000000 00000000 00000000  ................
2f0:  00000000 00000000 00000000 00000000  ................
300:  00000000 00000000 00000000 00000000  ................
310:  00000000 00000000 00000000 00000000  ................
320:  00000000 00000000 00000000 00000000  ................
330:  00000000 00000000 00000000 00000000  ................
340:  00000000 00000000 00000000 00000000  ................
350:  00000000 00000000 00000000 00000000  ................
360:  00000000 00000000 00000000 00000000  ................
370:  00000000 00000000 00000000 00000000  ................
380:  00000000 00000000 00000000 00000000  ................
390:  00000000 00000000 00000000 00000000  ................
3a0:  00000000 00000000 00000000 00000000  ................
3b0:  00000000 00000000 00000000 00000000  ................
3c0:  00000000 00000000 00000000 00000000  ................
3d0:  00000000 00000000 00000000 00000000  ................
3e0:  00000000 00000000 00000000 00000000  ................
3f0:  00000000 00000000 00000000 00000000  ................

LBA 34, which should be the first usable LBA:
> 0x4400::dump -f -l 0x200
       \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
4400:  00000000 00000000 00000000 00000000  ................
4410:  00000000 00000000 00000000 00000000  ................
4420:  00000000 00000000 00000000 00000000  ................
4430:  00000000 00000000 00000000 00000000  ................
4440:  00000000 00000000 00000000 00000000  ................
4450:  00000000 00000000 00000000 00000000  ................
4460:  00000000 00000000 00000000 00000000  ................
4470:  00000000 00000000 00000000 00000000  ................
4480:  00000000 00000000 00000000 00000000  ................
4490:  00000000 00000000 00000000 00000000  ................
44a0:  00000000 00000000 00000000 00000000  ................
44b0:  00000000 00000000 00000000 00000000  ................
44c0:  00000000 00000000 00000000 00000000  ................
44d0:  00000000 00000000 00000000 00000000  ................
44e0:  00000000 00000000 00000000 00000000  ................
44f0:  00000000 00000000 00000000 00000000  ................
4500:  00000000 00000000 00000000 00000000  ................
4510:  00000000 00000000 00000000 00000000  ................
4520:  00000000 00000000 00000000 00000000  ................
4530:  00000000 00000000 00000000 00000000  ................
4540:  00000000 00000000 00000000 00000000  ................
4550:  00000000 00000000 00000000 00000000  ................
4560:  00000000 00000000 00000000 00000000  ................
4570:  00000000 00000000 00000000 00000000  ................
4580:  00000000 00000000 00000000 00000000  ................
4590:  00000000 00000000 00000000 00000000  ................
45a0:  00000000 00000000 00000000 00000000  ................
45b0:  00000000 00000000 00000000 00000000  ................
45c0:  00000000 00000000 00000000 00000000  ................
45d0:  00000000 00000000 00000000 00000000  ................
45e0:  00000000 00000000 00000000 00000000  ................
45f0:  00000000 00000000 00000000 00000000  ................
smklein commented 6 months ago

Unless we have reason to believe otherwise from the control plane, the contents of the disk don't indicate that we had a partition ever in-use here. It's always possible we had something here and zeroed it out, but I'm not seeing ZFS headers or anything.

I'm not sure who formatted this GPT, but it could have been that way for a while?

Regardless, the case of "GPT exists, has zero partitions" is a case that we should handle.

smklein commented 6 months ago

Regardless, the case of "GPT exists, has zero partitions" is a case that we should handle.

Specifically, by adding a single partition for the zpool via zpool create

jclulow commented 6 months ago

I'm not sure I would expect anything at the first usable LBA FWIW. I'd do the same check on a disk that does have partitions and pools that work, and make sure you're not just lucking onto a region that is ordinarily zeroes.

smklein commented 6 months ago

I'm not sure I would expect anything at the first usable LBA FWIW. I'd do the same check on a disk that does have partitions and pools that work, and make sure you're not just lucking onto a region that is ordinarily zeroes.

Good point. Here's what I'm seeing on another disk. Maybe I need to dig deeper, this also shows a zero'd first block, even though it claims to have a ZFS partition.

BRM42220004 # mdb /dev/rdsk/c3t0014EE81000BC4F1d0p0
> ::load disk_label
> ::gpt
Signature: EFI PART (valid)
Revision: 1.0
HeaderSize: 92 bytes
HeaderCRC32: 0x17b6bd4b (should be 0x17b6bd4b)
Reserved1: 0 (should be 0x0)
MyLBA: 1 (should be 1)
AlternateLBA: 6251233967
FirstUsableLBA: 34
LastUsableLBA: 6251233934
DiskGUID: 3c61bf3f-81e5-ec55-ec1c-ff1aa537e314
PartitionEntryLBA: 2
NumberOfPartitionEntries: 9
SizeOfPartitionEntry: 0x80 bytes
PartitionEntryArrayCRC32: 0xf1791746 (should be 0xf1791746)

PART TYPE                STARTLBA      ENDLBA        ATTR     NAME
0    EFI_USR             256           6251217550    0        zfs
1    EFI_UNUSED         
2    EFI_UNUSED         
3    EFI_UNUSED         
4    EFI_UNUSED         
5    EFI_UNUSED         
6    EFI_UNUSED         
7    EFI_UNUSED         
8    EFI_RESERVED        6251217551    6251233934    0 

This should be LBA 256?
> 0x20000::dump -f -l 0x200
        \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
20000:  00000000 00000000 00000000 00000000  ................
20010:  00000000 00000000 00000000 00000000  ................
20020:  00000000 00000000 00000000 00000000  ................
20030:  00000000 00000000 00000000 00000000  ................
20040:  00000000 00000000 00000000 00000000  ................
20050:  00000000 00000000 00000000 00000000  ................
20060:  00000000 00000000 00000000 00000000  ................
20070:  00000000 00000000 00000000 00000000  ................
20080:  00000000 00000000 00000000 00000000  ................
20090:  00000000 00000000 00000000 00000000  ................
200a0:  00000000 00000000 00000000 00000000  ................
200b0:  00000000 00000000 00000000 00000000  ................
200c0:  00000000 00000000 00000000 00000000  ................
200d0:  00000000 00000000 00000000 00000000  ................
200e0:  00000000 00000000 00000000 00000000  ................
200f0:  00000000 00000000 00000000 00000000  ................
20100:  00000000 00000000 00000000 00000000  ................
20110:  00000000 00000000 00000000 00000000  ................
20120:  00000000 00000000 00000000 00000000  ................
20130:  00000000 00000000 00000000 00000000  ................
20140:  00000000 00000000 00000000 00000000  ................
20150:  00000000 00000000 00000000 00000000  ................
20160:  00000000 00000000 00000000 00000000  ................
20170:  00000000 00000000 00000000 00000000  ................
20180:  00000000 00000000 00000000 00000000  ................
20190:  00000000 00000000 00000000 00000000  ................
201a0:  00000000 00000000 00000000 00000000  ................
201b0:  00000000 00000000 00000000 00000000  ................
201c0:  00000000 00000000 00000000 00000000  ................
201d0:  00000000 00000000 00000000 00000000  ................
201e0:  00000000 00000000 00000000 00000000  ................
201f0:  00000000 00000000 00000000 00000000  ................\
smklein commented 6 months ago

Huh, on a "known good" disk, I am starting to see data around offset 0x23fd0, which is like LBA ~288? (or LBA 32 within the ZFS partition). This includes the zpool name, "meta slab" stuff, etc, and looks like a legit pool. Lemme check for that info on the misbehaving U.2s...

smklein commented 6 months ago

On /dev/rdsk/c10t0014EE81000BC440d0p0, aka /devices/pci@ab,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC440, this is still zeroed around LBA 288, which is where I started seeing zpool metadata.

I'm seeing something very odd on the other disk:

BRM42220004 # mdb /devices/pci@38,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC129,0:q,raw
> 
> 
> ::load disk_label
> 
> 
> ::gpt
Signature: EFI PART (valid)
Revision: 1.0
HeaderSize: 92 bytes
HeaderCRC32: 0xcbd22368 (should be 0xcbd22368)
Reserved1: 0 (should be 0x0)
MyLBA: 1 (should be 1)
AlternateLBA: 6251233967
FirstUsableLBA: 34
LastUsableLBA: 6251233934
DiskGUID: c9172905-e06f-4af9-b61a-ceb23c9add2a
PartitionEntryLBA: 2
NumberOfPartitionEntries: 1
SizeOfPartitionEntry: 0x80 bytes
PartitionEntryArrayCRC32: 0x8242bb87 (should be 0x8242bb87)

PART TYPE                STARTLBA      ENDLBA        ATTR     NAME
0    EFI_USR             256           255           0        

ENDLBA < STARTLBA? That seems weird.

smklein commented 6 months ago

On /dev/rdsk/c10t0014EE81000BC440d0p0, aka /devices/pci@ab,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC440, this is still zeroed around LBA 288, which is where I started seeing zpool metadata.

I'm seeing something very odd on the other disk:

...

Also seeing nothing at ~LBA 288 and onwards (where we've seen zpool metadata on other valid disks). Just zeroes.

smklein commented 6 months ago

For anyone trying to manually check the zpool metadata:

$ mdb /dev/rdsk/<pick whatever disk you want>p0
> ::load disk_label
# With `::gpt`: You should see partitions 0 - 8, and partition 0 starts at LBA 256
> ::gpt
# This is LBA 288, which is 32 LBAs into the first partition. You should see zpool metadata.
> 0x24000::dump -l 0x200

On any normal U.2: I see this metadata. On these two: I don't.

smklein commented 6 months ago

To add one more recap here:

Fixing the "GPT but no partitions" case doesn't seem terrible -- we should be able to still zpool create there anyway.

Fixing the "GPT exists, AND has a partition, BUT it sucks" case seems more difficult. It seems much trickier to determine "is this partition actually, truly, genuinely unusable?" in a way that's completely safe, and wouldn't accidentally destroy valid data. In this situation, the "end LBA" is smaller than the "start LBA" which kinda looks like "obviously bad" to me a human, but that's a weird heuristic for a program to use - and certainly not the only way in which we could have a "partition zero" that looks invalid.

andrewjstone commented 6 months ago

Sean, is doing a rework of the disk adoption process, and he's going to take this issue. He's already debugged it all anyway, so may as well get full credit 🥇