oxidecomputer / crucible

A storage service.
Mozilla Public License 2.0
175 stars 18 forks source link

mystery failure on london BRM42220036 #1553

Open faithanalog opened 2 weeks ago

faithanalog commented 2 weeks ago

This issue happened twice, both times with the 4krandw workload. The first time we didn't get the logs we wanted because we accidentally kernel panic'd the system trying to recover them. Second time around, we got them. I haven't looked into them yet, just writing this up before logging off for sleep.

Notably, the VMs seem to keep writing (to where?), and thus have abnormally high IOPS. In actuality the disk has failed out from under them.

it is extremely suspicious that both times, all the failing VMs were on BRM42220036 in particular, and not any of the other sleds.

logs: /staff/artemis/2024-11-08-londonBRM42220036-explosion/logs

suspect VM UUIDs, look at these in that dir:

0f5c9139-4d66-4ae9-885f-08967d2c13ce
4e453ffc-d871-49b3-88fa-2a816aff9bde
0881cb23-b061-487c-900e-179e92386fe2
b9242f22-f8c5-4804-8c3d-fbbe3a73e5cb
07a66988-1402-41e1-959b-1d114129d568
9d50966a-f98c-481e-ab3f-55c83a12e702
e1ab4a1d-535c-44e1-b8f6-f9813ddfc7be
2f28e04c-a25b-47ae-9312-faede629f8ee
c52b3d20-1eb7-427f-bcc0-30979cf6605c
48b70269-9f20-4841-8599-81c0a2184bf3

extra data: /staff/artemis/2024-11-08-londonBRM42220036-explosion/measure-4krandw

a sampling of dmesg from one of the guest VMs

[   12.396400] EDAC MC: Ver: 3.0.0
[   13.257310] IPv6: ADDRCONF(NETDEV_CHANGE): enp0s8: link becomes ready
[   15.005756] NET: Registered PF_PACKET protocol family
[ 6536.202312] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large:
[ 6536.202323] clocksource:                       'acpi_pm' wd_nsec: 396359593 wd_now: 2a0773 wd_last: 146150 mask: ffffff
[ 6536.202326] clocksource:                       'tsc' cs_nsec: 5083341649 cs_now: be15eaea178 cs_last: bdf01dbbfdc mask: ffffffffffffffff
[ 6536.202328] clocksource:                       'tsc' is current clocksource.
[ 6536.202338] tsc: Marking TSC unstable due to clocksource watchdog
[ 6536.203745] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
[ 6536.203753] sched_clock: Marking unstable (6536636056199, -432146598)<-(6536205179789, -1437751)
[ 6536.205910] clocksource: Not enough CPUs to check clocksource 'tsc'.
[ 6536.205930] clocksource: Switched to clocksource acpi_pm
[ 7026.601955] hrtimer: interrupt took 22633310 ns
[ 7893.772021] nvme nvme1: I/O 832 (I/O Cmd) QID 1 timeout, aborting
[ 7893.772081] nvme nvme1: Abort status: 0x0
[ 7893.772109] nvme nvme1: I/O 838 (I/O Cmd) QID 1 timeout, aborting
[ 7893.772155] nvme nvme1: Abort status: 0x0
[ 7893.772168] nvme nvme1: I/O 839 (I/O Cmd) QID 1 timeout, aborting
[ 7893.772208] nvme nvme1: Abort status: 0x0
[ 7893.772221] nvme nvme1: I/O 840 (I/O Cmd) QID 1 timeout, aborting
[ 7893.772253] nvme nvme1: Abort status: 0x0
[ 7893.772271] nvme nvme1: I/O 841 (I/O Cmd) QID 1 timeout, aborting
[ 7893.772303] nvme nvme1: Abort status: 0x0
[ 7893.772316] nvme nvme1: I/O 842 (I/O Cmd) QID 1 timeout, aborting
[ 7893.772348] nvme nvme1: Abort status: 0x0
[ 7893.772363] nvme nvme1: I/O 843 (I/O Cmd) QID 1 timeout, aborting
[ 7893.772396] nvme nvme1: Abort status: 0x0
[ 7893.772416] nvme nvme1: I/O 844 (I/O Cmd) QID 1 timeout, aborting
[ 7893.772453] nvme nvme1: Abort status: 0x0
[ 7893.772467] nvme nvme1: I/O 845 (I/O Cmd) QID 1 timeout, aborting
[ 7893.772505] nvme nvme1: Abort status: 0x0
[ 7893.772518] nvme nvme1: I/O 846 (I/O Cmd) QID 1 timeout, aborting
[ 7893.772550] nvme nvme1: Abort status: 0x0
[ 7893.772563] nvme nvme1: I/O 847 (I/O Cmd) QID 1 timeout, aborting
[ 7893.772599] nvme nvme1: Abort status: 0x0
[ 7893.772618] nvme nvme1: I/O 848 (I/O Cmd) QID 1 timeout, aborting
[ 7893.772650] nvme nvme1: Abort status: 0x0
[ 7893.772662] nvme nvme1: I/O 849 (I/O Cmd) QID 1 timeout, aborting
[ 7893.772694] nvme nvme1: Abort status: 0x0
[ 7893.772707] nvme nvme1: I/O 850 (I/O Cmd) QID 1 timeout, aborting
[ 7893.772739] nvme nvme1: Abort status: 0x0
[ 7893.772752] nvme nvme1: I/O 851 (I/O Cmd) QID 1 timeout, aborting
[ 7893.772788] nvme nvme1: Abort status: 0x0
[ 7893.772801] nvme nvme1: I/O 853 (I/O Cmd) QID 1 timeout, aborting
[ 7893.772834] nvme nvme1: Abort status: 0x0
[ 7893.772847] nvme nvme1: I/O 854 (I/O Cmd) QID 1 timeout, aborting
[ 7893.772930] nvme nvme1: Abort status: 0x0
[ 7893.774908] nvme nvme1: I/O 855 (I/O Cmd) QID 1 timeout, aborting
[ 7893.774954] nvme nvme1: Abort status: 0x0
[ 7893.774969] nvme nvme1: I/O 856 (I/O Cmd) QID 1 timeout, aborting
[ 7893.775001] nvme nvme1: Abort status: 0x0
[ 7893.775013] nvme nvme1: I/O 857 (I/O Cmd) QID 1 timeout, aborting
[ 7893.775044] nvme nvme1: Abort status: 0x0
[ 7893.775057] nvme nvme1: I/O 858 (I/O Cmd) QID 1 timeout, aborting
[ 7893.775088] nvme nvme1: Abort status: 0x0
[ 7893.775105] nvme nvme1: I/O 859 (I/O Cmd) QID 1 timeout, aborting
[ 7893.775141] nvme nvme1: Abort status: 0x0
[ 7893.775155] nvme nvme1: I/O 874 (I/O Cmd) QID 1 timeout, aborting
[ 7893.775186] nvme nvme1: Abort status: 0x0
[ 7893.775198] nvme nvme1: I/O 875 (I/O Cmd) QID 1 timeout, aborting
[ 7893.775229] nvme nvme1: Abort status: 0x0
[ 7893.775242] nvme nvme1: I/O 876 (I/O Cmd) QID 1 timeout, aborting
[ 7893.775273] nvme nvme1: Abort status: 0x0
[ 7893.775286] nvme nvme1: I/O 877 (I/O Cmd) QID 1 timeout, aborting
[ 7893.775319] nvme nvme1: Abort status: 0x0
[ 7893.775332] nvme nvme1: I/O 878 (I/O Cmd) QID 1 timeout, aborting
[ 7893.775364] nvme nvme1: Abort status: 0x0
[ 7893.775376] nvme nvme1: I/O 879 (I/O Cmd) QID 1 timeout, aborting
[ 7893.775407] nvme nvme1: Abort status: 0x0
[ 7893.775423] nvme nvme1: I/O 880 (I/O Cmd) QID 1 timeout, aborting
[ 7893.775455] nvme nvme1: Abort status: 0x0
[ 7893.775467] nvme nvme1: I/O 881 (I/O Cmd) QID 1 timeout, aborting
[ 7893.775498] nvme nvme1: Abort status: 0x0
[ 7893.775510] nvme nvme1: I/O 882 (I/O Cmd) QID 1 timeout, aborting
[ 7893.775541] nvme nvme1: Abort status: 0x0
[ 7893.775553] nvme nvme1: I/O 883 (I/O Cmd) QID 1 timeout, aborting
[ 7893.775583] nvme nvme1: Abort status: 0x0
[ 7893.775597] nvme nvme1: I/O 891 (I/O Cmd) QID 1 timeout, aborting
[ 7893.775627] nvme nvme1: Abort status: 0x0
[ 7893.775640] nvme nvme1: I/O 892 (I/O Cmd) QID 1 timeout, aborting
[ 7893.775671] nvme nvme1: Abort status: 0x0
[ 7893.775689] nvme nvme1: I/O 895 (I/O Cmd) QID 1 timeout, aborting
[ 7893.775719] nvme nvme1: Abort status: 0x0
[ 8068.581663] nvme1n1: I/O Cmd(0x2) @ LBA 0, 1 blocks, I/O Error (sct 0x0 / sc 0x4) 
[ 8068.582652] I/O error, dev nvme1n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[ 8068.582652] Buffer I/O error on dev nvme1n1, logical block 0, async page read
[ 8068.620609] nvme1n1: I/O Cmd(0x2) @ LBA 0, 1 blocks, I/O Error (sct 0x0 / sc 0x4) 
[ 8068.620889] I/O error, dev nvme1n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[ 8068.620889] Buffer I/O error on dev nvme1n1, logical block 0, async page read
[ 8068.624543] ldm_validate_partition_table(): Disk read failed.
[ 8068.625245] nvme1n1: I/O Cmd(0x2) @ LBA 0, 1 blocks, I/O Error (sct 0x0 / sc 0x4) 
[ 8068.625882] I/O error, dev nvme1n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[ 8068.625882] Buffer I/O error on dev nvme1n1, logical block 0, async page read
[ 8068.628915]  nvme1n1: unable to read partition table
[ 8068.673398] nvme1n1: I/O Cmd(0x2) @ LBA 0, 1 blocks, I/O Error (sct 0x0 / sc 0x4) 
[ 8068.674387] I/O error, dev nvme1n1, sector 0 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
[ 8068.677450] nvme1n1: I/O Cmd(0x2) @ LBA 0, 1 blocks, I/O Error (sct 0x0 / sc 0x4) 
[ 8068.678443] I/O error, dev nvme1n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[ 8068.678443] Buffer I/O error on dev nvme1n1, logical block 0, async page read
faithanalog commented 2 weeks ago

commits:

mkeeter commented 1 week ago

We investigated this on Friday and made some progress, but the root cause remains elusive.

There appears to be network weather of some kind; the first warning in the logs is of a (partial) timeout. I'm using interleave-bunyan to combine all logs, then paging through:

matt@atrium /staff/core/crucible-1553/logs $ ./interleave-bunyan */*propolis* | looker | less
04:09:22.394Z WARN propolis-server (vm_state_driver): timeout 1/3
     = downstairs
    client = 2
    session_id = 822e8ddc-9945-46db-bb38-421e833e1052
    source = 4e453ffc-d871-49b3-88fa-2a816aff9bde/system-illumos-propolis-server:default.log.1731039289

Scattered partial timeouts continue, then we see our first disconnection at 4:13:04:

04:13:04.695Z WARN propolis-server (vm_state_driver): client task is sending Done(WriteFailed(IO Error: Os { code: 32, kind: BrokenPipe, message: "Broken pipe" }))
     = downstairs
    client = 1
    session_id = 4eba309e-a613-4c90-9742-62afd5671c95
    source = d3f3e6fd-61de-444c-832e-45af8aa04741/system-illumos-propolis-server:default.log.1731039301
04:13:04.702Z WARN propolis-server (vm_state_driver): downstairs client error Connection reset by peer (os error 131)
     = downstairs
    client = 1
    session_id = 4eba309e-a613-4c90-9742-62afd5671c95
    source = d3f3e6fd-61de-444c-832e-45af8aa04741/system-illumos-propolis-server:default.log.1731039301

Both the write and read sides fail with EPIPE and ECONNRESET respectively.

In this case, client 1 is at [fd00:1122:3344:101::e]:19001. I'm pretty sure this is a Downstairs running on the exact same sled, because I see 1122:3344:101 elsewhere in the logs:

04:01:46.351Z INFO propolis-server: listening
    local_addr = [fd00:1122:3344:101::1:1]:12400

Things continue to fail at various rates, getting worse over time. Eventually, we manage to get into a state with all 3x Downstairs faulted, from which there is no recovery (#1556). There's some low-hanging fruit w.r.t less logging in Crucible, but that's not the root cause of our problems.


Other things to be suspicious about:

faithanalog commented 1 week ago

We were running prstat sorted by RSS at the time.

On the problem scrimlet:

   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
 16769 root     4029M 3279M sleep   59    0   7:38:30 0.2% oximeter/131
  1484 root     2830M 1822M sleep   59    0   1:47:23 0.1% mgs/132

On the scrimlet which did not keel over:

   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
  1492 root     2869M 2736M sleep   59    0   1:45:04 0.1% mgs/132
(no oximeter)

I didn't catch a pmap on these at the time, but I suspect both mgs are actually using the same amount of ram but the problem scrimlet has mgs swapping (which is why the RSS and SIZE dont line up there). and you've got oximeter adding an extra 4 gigs of memory pressure


Also on the topic of memory pressure, our propolis servers were climbing quite high in their userspace usage, getting into the hundred of megs, or peaking towards the 1 gig that the version we were running has as the limit. Unclear whether that's a root cause or a knock-on effect from something else. #1515 will bring our buffer length limit down to 50MiB

faithanalog commented 1 week ago

I added the memstat/vmstat/prstat/swapsh loop outputs to /staff/core/crucible-1553/retry-first-crucible-failure/BRMxxxxxxxx