Open mkeeter opened 3 months ago
(could be related to #1334)
EVT22200004 # sed -i '/^\*/d; /^$/d' /etc/system
EVT22200004 # echo set zfs:zfs_sync_taskq_batch_pct = 25 >> /etc/system
EVT22200004 # /usr/platform/oxide/bin/ipcc keyset -c system /etc/system
Success
... reboot ...
> ::walk taskq_cache | ::printf "%x %s\n" taskq_t . tq_name ! grep dp_sync_taskq
fffffcfa1932fa68 dp_sync_taskq
> fffffcfa1932fa68::print taskq_t tq_nthreads
tq_nthreads = 0x20
nice this method works.
reducing from 75% to 25% didn't really seem to do anything. compare to this run, which is identical except for this change
This talk covers optimizing ZFS for NVMe drives, and has a few relevant points. At roughly (27:03), the speaker talks about disabling aggregation, because we're far from saturating the drives.
Looking at one of the flamegraphs, this seems relevant: we spent a bunch of time in vdev_io_start
and vdev_io_done
waiting for the vdev lock, and the actual work done while holding that lock is primarily vdev_queue_aggregate
(in vdev_io_to_issue
):
Aggregation is controlled by the zfs_vdev_aggregation_limit
tunable, which we could change at runtime (or using a config fragment like before). By default, it's 1 MiB (1 << 20).
OpenZFS introduced zfs_vdev_aggregation_limit_non_rotating
(defaulted to 128 KiB) back in 2019 to address this issue:
https://github.com/openzfs/zfs/commit/1af240f3b51c080376bb6ae1efc13d62b087b65d
Back on the mutex front: looking at this flamegraph (from the run with reduced dp_sync_taskq
), a bunch of lock time is in spa_taskq_dispatch_ent
, which dispatches to the SPA task queue (e.g. zio_write_issue
, not dp_sync_taskq
).
The number of threads in the SPA task queue is controlled separately by zio_taskq_batch_pct
(not zfs_sync_taskq_batch_pct
), so we should run a test with both of those tuneables turned down.
watching the talk, we get this interesting slide
Accompanied with this table
in illumos we have a slightly different table:
const zio_taskq_info_t zio_taskqs[ZIO_TYPES][ZIO_TASKQ_TYPES] = {
/* ISSUE ISSUE_HIGH INTR INTR_HIGH */
{ ZTI_ONE, ZTI_NULL, ZTI_ONE, ZTI_NULL }, /* NULL */
{ ZTI_N(8), ZTI_NULL, ZTI_P(12, 8), ZTI_NULL }, /* READ */
{ ZTI_BATCH, ZTI_N(5), ZTI_N(8), ZTI_N(5) }, /* WRITE */
{ ZTI_P(12, 8), ZTI_NULL, ZTI_ONE, ZTI_NULL }, /* FREE */
{ ZTI_ONE, ZTI_NULL, ZTI_ONE, ZTI_NULL }, /* CLAIM */
{ ZTI_ONE, ZTI_NULL, ZTI_ONE, ZTI_NULL }, /* IOCTL */
{ ZTI_N(4), ZTI_NULL, ZTI_ONE, ZTI_NULL }, /* TRIM */
};
They were discussing this ZTI_SCALE
mode they added, which we do not have in illumos, which scales the number of processes up with CPU cores instead of scaling threads up. We are using ZTI_BATCH
for async writes, with a single process and number of threads scaling up. We don't have ZTI_SCALE
in illumos, but we could use ZTI_P
to experiment with the performance impact of doing this, since that can achieve the same thing with fixed values.
Especially of note to me: synchronous writes use the ISSUE_HIGH queue, with precisely one process and 5 threads, regardless of tuneables. They switched this to ZTI_SCALE and saw a big benefit from that.
see also, https://github.com/oxidecomputer/crucible/issues/1358 - what does having fewer syncs do for us?
Ry as a PR out to make it easier to build tuf repos with a custom helios - https://github.com/oxidecomputer/omicron/pull/6126 - I'm going to try to get this in and working on my own machine, and then I can experiment with tuning these.
Flamegraphs from
iodriver
4K random writes show significant amounts of time being spent in kernel mutex spinlocks, e.g.(interactive, originally from this run)
Lockstat shows two stacks (on the same lock) being responsible for 63% of total lock time:
(histograms elided)
We can use
mdb
to find whichtaskq
is at fault here. (The system has been rebooted since the above capture, so I ranlockstat -W -Ch sleep 5
to get the current address)This
taskq
was created withi.e. creating
0.75 * NUM_CPUS
threads. We can confirm inmdb
:This is probably Too Many Threads, falling into the same category illumos#16202 (i.e. defaults that seem reasonable for normal computers, but not for a Big Computer).
OpenZFS has changed this behavior: https://github.com/openzfs/zfs/commit/3bd4df3841529316e5145590cc67076467b6abb7
We can locally tweak this behavior using the SP to hold a kernel configuration fragment, i.e.
(then rebooting the Gimlet)
We can change it persistently by editing
gimlet-system-zfs:dbuf
@faithanalog I think it would be interesting to do an
iodriver
run with a reducedzfs_sync_taskq_batch_pct
and see what happens!