Open faithanalog opened 2 weeks ago
commits:
We investigated this on Friday and made some progress, but the root cause remains elusive.
There appears to be network weather of some kind; the first warning in the logs is of a (partial) timeout. I'm using interleave-bunyan to combine all logs, then paging through:
matt@atrium /staff/core/crucible-1553/logs $ ./interleave-bunyan */*propolis* | looker | less
04:09:22.394Z WARN propolis-server (vm_state_driver): timeout 1/3
= downstairs
client = 2
session_id = 822e8ddc-9945-46db-bb38-421e833e1052
source = 4e453ffc-d871-49b3-88fa-2a816aff9bde/system-illumos-propolis-server:default.log.1731039289
Scattered partial timeouts continue, then we see our first disconnection at 4:13:04:
04:13:04.695Z WARN propolis-server (vm_state_driver): client task is sending Done(WriteFailed(IO Error: Os { code: 32, kind: BrokenPipe, message: "Broken pipe" }))
= downstairs
client = 1
session_id = 4eba309e-a613-4c90-9742-62afd5671c95
source = d3f3e6fd-61de-444c-832e-45af8aa04741/system-illumos-propolis-server:default.log.1731039301
04:13:04.702Z WARN propolis-server (vm_state_driver): downstairs client error Connection reset by peer (os error 131)
= downstairs
client = 1
session_id = 4eba309e-a613-4c90-9742-62afd5671c95
source = d3f3e6fd-61de-444c-832e-45af8aa04741/system-illumos-propolis-server:default.log.1731039301
Both the write and read sides fail with EPIPE
and ECONNRESET
respectively.
In this case, client 1 is at [fd00:1122:3344:101::e]:19001
. I'm pretty sure this is a Downstairs running on the exact same sled, because I see 1122:3344:101
elsewhere in the logs:
04:01:46.351Z INFO propolis-server: listening
local_addr = [fd00:1122:3344:101::1:1]:12400
Things continue to fail at various rates, getting worse over time. Eventually, we manage to get into a state with all 3x Downstairs faulted, from which there is no recovery (#1556). There's some low-hanging fruit w.r.t less logging in Crucible, but that's not the root cause of our problems.
Other things to be suspicious about:
We were running prstat
sorted by RSS at the time.
On the problem scrimlet:
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
16769 root 4029M 3279M sleep 59 0 7:38:30 0.2% oximeter/131
1484 root 2830M 1822M sleep 59 0 1:47:23 0.1% mgs/132
On the scrimlet which did not keel over:
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
1492 root 2869M 2736M sleep 59 0 1:45:04 0.1% mgs/132
(no oximeter)
I didn't catch a pmap on these at the time, but I suspect both mgs are actually using the same amount of ram but the problem scrimlet has mgs swapping (which is why the RSS and SIZE dont line up there). and you've got oximeter adding an extra 4 gigs of memory pressure
Also on the topic of memory pressure, our propolis servers were climbing quite high in their userspace usage, getting into the hundred of megs, or peaking towards the 1 gig that the version we were running has as the limit. Unclear whether that's a root cause or a knock-on effect from something else. #1515 will bring our buffer length limit down to 50MiB
I added the memstat/vmstat/prstat/swapsh loop outputs to /staff/core/crucible-1553/retry-first-crucible-failure/BRMxxxxxxxx
This issue happened twice, both times with the 4krandw workload. The first time we didn't get the logs we wanted because we accidentally kernel panic'd the system trying to recover them. Second time around, we got them. I haven't looked into them yet, just writing this up before logging off for sleep.
Notably, the VMs seem to keep writing (to where?), and thus have abnormally high IOPS. In actuality the disk has failed out from under them.
it is extremely suspicious that both times, all the failing VMs were on BRM42220036 in particular, and not any of the other sleds.
logs:
/staff/artemis/2024-11-08-londonBRM42220036-explosion/logs
suspect VM UUIDs, look at these in that dir:
extra data:
/staff/artemis/2024-11-08-londonBRM42220036-explosion/measure-4krandw
lrstate
seems to indicate that4e453ffc-d871-49b3-88fa-2a816aff9bde
(lrstate has propolis UUIDs, gotta map them to the VM UUID), but none of the others, has 266ds_live_repair_aborted
and none completed. the rest have none aborted, none completed.a sampling of dmesg from one of the guest VMs