What caused so many failures in the 4k record size test on the London cluster

leftwo commented 5 months ago

@faithanalog ran a test on the Madrid cluster, the details of which are recorded here:

https://github.com/oxidecomputer/artemiscellaneous/tree/main/cru-perf/2024-05-24-4krecordsize-omicron_c2f51e25-crucible_92fe269-propolis_6d7ed9a

There was some trouble getting all 64 VMs to start, and there could be noise here in the logs from those errors vs. errors that happened under IO load.

The logs from propolis and crucible downstairs are here: /staff/artemis/cru-perf/2024-05-24-4krecordsize-omicron_c2f51e25-crucible_92fe269-propolis_6d7ed9a/logs.tar.gz

leftwo commented 5 months ago

A first pass search of all the logs I find: No panics.

Searching upstairs logs for why clients disconnected:

cat logs/*/*propolis* | grep "client task is sending Done" | looker |  grep "client task is sending" > client-done.txt

Then vim and chop off the front of the file, then:

cat client-done.txt | sort -n | uniq -c

Gives us this summary:

1325 Done(ConnectionTimeout)
  95 Done(Timeout)
  67 Done(RequestedStop(TooManyOutstandingJobs))
  58 Done(RequestedStop(IOError))
  40 Done(Finished)
  35 Done(ConnectionFailed(Os { code: 146, kind: ConnectionRefused, message: "Connection refused" }))
  12 Done(RequestedStop(FailedLiveRepair))
   7 Done(WriteFailed(IO Error: Os { code: 12, kind: OutOfMemory, message: "Not enough space" }))
   1 Done(WriteFailed(IO Error: Os { code: 32, kind: BrokenPipe, message: "Broken pipe" }))

1325 Done(ConnectionTimeout) This is when the upstairs is trying to connect to a downstairs, but is not able to make the connection, so those are either never being able to connect, or after a connection has failed, trying to reconnect when the downstairs is not answering.

95 Done(Timeout) Happens when the downstairs fails to answer any message from the upstairs for 45 seconds.

67 Done(RequestedStop(TooManyOutstandingJobs)) Is when the upstairs has built up enough IO that it has to give up holding it for the downstairs as the downstairs is either not answering at all, or is not answering at a fast enough rate.

leftwo commented 5 months ago

Logs are extracted at /staff/core/crucible-1334/4k-recordsize-debug

mkeeter commented 3 months ago

After doing some research (for RFD 490), here's what I found: when doing writes, illumos ZFS splits into one transaction per recordsize. This means that going from 128 KiB → 4 KiB recordsize, we end up with 32× more transactions, which (presumably) adds a bunch of overhead.

OpenZFS has a patch for this: https://github.com/openzfs/zfs/commit/b0cbc1aa9

We should consider backporting it and then testing small recordsizes again.

oxidecomputer / crucible

What caused so many failures in the 4k record size test on the London cluster #1334