Open leftwo opened 5 months ago
A first pass search of all the logs I find: No panics.
Searching upstairs logs for why clients disconnected:
cat logs/*/*propolis* | grep "client task is sending Done" | looker | grep "client task is sending" > client-done.txt
Then vim and chop off the front of the file, then:
cat client-done.txt | sort -n | uniq -c
Gives us this summary:
1325 Done(ConnectionTimeout)
95 Done(Timeout)
67 Done(RequestedStop(TooManyOutstandingJobs))
58 Done(RequestedStop(IOError))
40 Done(Finished)
35 Done(ConnectionFailed(Os { code: 146, kind: ConnectionRefused, message: "Connection refused" }))
12 Done(RequestedStop(FailedLiveRepair))
7 Done(WriteFailed(IO Error: Os { code: 12, kind: OutOfMemory, message: "Not enough space" }))
1 Done(WriteFailed(IO Error: Os { code: 32, kind: BrokenPipe, message: "Broken pipe" }))
1325 Done(ConnectionTimeout)
This is when the upstairs is trying to connect to a downstairs, but
is not able to make the connection, so those are either never being able to connect, or after a
connection has failed, trying to reconnect when the downstairs is not answering.
95 Done(Timeout)
Happens when the downstairs fails to answer any message from the upstairs for 45 seconds.
67 Done(RequestedStop(TooManyOutstandingJobs))
Is when the upstairs has built up enough IO that it has to give up holding it for the downstairs
as the downstairs is either not answering at all, or is not answering at a fast enough rate.
Logs are extracted at /staff/core/crucible-1334/4k-recordsize-debug
After doing some research (for RFD 490), here's what I found: when doing writes, illumos ZFS splits into one transaction per recordsize. This means that going from 128 KiB → 4 KiB recordsize, we end up with 32× more transactions, which (presumably) adds a bunch of overhead.
OpenZFS has a patch for this: https://github.com/openzfs/zfs/commit/b0cbc1aa9
We should consider backporting it and then testing small recordsizes again.
@faithanalog ran a test on the Madrid cluster, the details of which are recorded here:
https://github.com/oxidecomputer/artemiscellaneous/tree/main/cru-perf/2024-05-24-4krecordsize-omicron_c2f51e25-crucible_92fe269-propolis_6d7ed9a
There was some trouble getting all 64 VMs to start, and there could be noise here in the logs from those errors vs. errors that happened under IO load.
The logs from propolis and crucible downstairs are here: /staff/artemis/cru-perf/2024-05-24-4krecordsize-omicron_c2f51e25-crucible_92fe269-propolis_6d7ed9a/logs.tar.gz