Lock contention during `iodriver` runs

mkeeter commented 3 months ago

Flamegraphs from iodriver 4K random writes show significant amounts of time being spent in kernel mutex spinlocks, e.g.

flamegraph-4krandw-londonBRM42220030-500ms-stack-ustack_16KiB (interactive, originally from this run)

Lockstat shows two stacks (on the same lock) being responsible for 63% of total lock time:

Count indv cuml rcnt     nsec Hottest Lock           Caller                  
902020  35%  35% 0.00   191379 0xfffffcfa2064cca0     taskq_thread+0x2cd      
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Hottest Lock           Caller                  
789689  30%  65% 0.00   197303 0xfffffcfa2064cca0     cv_wait+0x78

(histograms elided)

We can use mdb to find which taskq is at fault here. (The system has been rebooted since the above capture, so I ran lockstat -W -Ch sleep 5 to get the current address)

BRM42220030 # head -n10 lockstat.txt

Adaptive mutex spin: 21566 events in 5.019 seconds (4297 events/sec)

-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Hottest Lock           Caller
 6838  32%  32% 0.00   197362 0xfffffd02ce540be0     cv_wait+0x78

      nsec ------ Time Distribution ------ count     Stack
       256 |                               9         taskq_thread_wait+0xbe
       512 |                               112       taskq_thread+0x334
BRM42220030 # mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic apix cpc crypto mm random smbios zfs sata ip hook neti sockfs lofs vmm scsi_vhci arp ufs ipcc logindmux nsmb ptm ]
> 0xfffffd02ce540be0::whatis
fffffd02ce540be0 is fffffd02ce540bc0+20, allocated from taskq_cache
> fffffd02ce540bc0::print taskq_t tq_name
tq_name = [ "dp_sync_taskq" ]

This taskq was created with

/*
 * This determines the number of threads used by the dp_sync_taskq.
 */
int zfs_sync_taskq_batch_pct = 75;

// ...

    dp->dp_sync_taskq = taskq_create("dp_sync_taskq",
        zfs_sync_taskq_batch_pct, minclsyspri, 1, INT_MAX,
        TASKQ_THREADS_CPU_PCT);

i.e. creating 0.75 * NUM_CPUS threads. We can confirm in mdb:

> fffffd02ce540bc0::print taskq_t tq_nthreads
tq_nthreads = 0x60

This is probably Too Many Threads, falling into the same category illumos#16202 (i.e. defaults that seem reasonable for normal computers, but not for a Big Computer).

OpenZFS has changed this behavior: https://github.com/openzfs/zfs/commit/3bd4df3841529316e5145590cc67076467b6abb7

We can locally tweak this behavior using the SP to hold a kernel configuration fragment, i.e.

sed -i '/^\*/d; /^$/d' /etc/system
echo set zfs:zfs_sync_taskq_batch_pct = 50 >> /etc/system
/usr/platform/oxide/bin/ipcc keyset -c system /etc/system

(then rebooting the Gimlet)

We can change it persistently by editing gimlet-system-zfs:dbuf

@faithanalog I think it would be interesting to do an iodriver run with a reduced zfs_sync_taskq_batch_pct and see what happens!

mkeeter commented 3 months ago

(could be related to #1334)

faithanalog commented 2 months ago

EVT22200004 # sed -i '/^\*/d; /^$/d' /etc/system
EVT22200004 # echo set zfs:zfs_sync_taskq_batch_pct = 25 >> /etc/system
EVT22200004 # /usr/platform/oxide/bin/ipcc keyset -c system /etc/system
Success

... reboot ...

> ::walk taskq_cache | ::printf "%x %s\n" taskq_t . tq_name ! grep dp_sync_taskq
fffffcfa1932fa68 dp_sync_taskq
> fffffcfa1932fa68::print taskq_t tq_nthreads
tq_nthreads = 0x20

nice this method works.

faithanalog commented 2 months ago

did a test a couple weeks ago

reducing from 75% to 25% didn't really seem to do anything. compare to this run, which is identical except for this change

mkeeter commented 1 month ago

This talk covers optimizing ZFS for NVMe drives, and has a few relevant points. At roughly (27:03), the speaker talks about disabling aggregation, because we're far from saturating the drives.

Looking at one of the flamegraphs, this seems relevant: we spent a bunch of time in vdev_io_start and vdev_io_done waiting for the vdev lock, and the actual work done while holding that lock is primarily vdev_queue_aggregate (in vdev_io_to_issue):

Aggregation is controlled by the zfs_vdev_aggregation_limit tunable, which we could change at runtime (or using a config fragment like before). By default, it's 1 MiB (1 << 20).

OpenZFS introduced zfs_vdev_aggregation_limit_non_rotating (defaulted to 128 KiB) back in 2019 to address this issue:

https://github.com/openzfs/zfs/commit/1af240f3b51c080376bb6ae1efc13d62b087b65d

mkeeter commented 1 month ago

Back on the mutex front: looking at this flamegraph (from the run with reduced dp_sync_taskq), a bunch of lock time is in spa_taskq_dispatch_ent, which dispatches to the SPA task queue (e.g. zio_write_issue, not dp_sync_taskq).

The number of threads in the SPA task queue is controlled separately by zio_taskq_batch_pct (not zfs_sync_taskq_batch_pct), so we should run a test with both of those tuneables turned down.

faithanalog commented 1 month ago

watching the talk, we get this interesting slide

taskq Screenshot_20240725_154212

Accompanied with this table

taskq tuning Screenshot_20240725_154356

in illumos we have a slightly different table:

const zio_taskq_info_t zio_taskqs[ZIO_TYPES][ZIO_TASKQ_TYPES] = {
    /* ISSUE    ISSUE_HIGH  INTR        INTR_HIGH */
    { ZTI_ONE,  ZTI_NULL,   ZTI_ONE,    ZTI_NULL }, /* NULL */
    { ZTI_N(8), ZTI_NULL,   ZTI_P(12, 8),   ZTI_NULL }, /* READ */
    { ZTI_BATCH,    ZTI_N(5),   ZTI_N(8),   ZTI_N(5) }, /* WRITE */
    { ZTI_P(12, 8), ZTI_NULL,   ZTI_ONE,    ZTI_NULL }, /* FREE */
    { ZTI_ONE,  ZTI_NULL,   ZTI_ONE,    ZTI_NULL }, /* CLAIM */
    { ZTI_ONE,  ZTI_NULL,   ZTI_ONE,    ZTI_NULL }, /* IOCTL */
    { ZTI_N(4), ZTI_NULL,   ZTI_ONE,    ZTI_NULL }, /* TRIM */
};

They were discussing this ZTI_SCALE mode they added, which we do not have in illumos, which scales the number of processes up with CPU cores instead of scaling threads up. We are using ZTI_BATCH for async writes, with a single process and number of threads scaling up. We don't have ZTI_SCALE in illumos, but we could use ZTI_P to experiment with the performance impact of doing this, since that can achieve the same thing with fixed values.

Especially of note to me: synchronous writes use the ISSUE_HIGH queue, with precisely one process and 5 threads, regardless of tuneables. They switched this to ZTI_SCALE and saw a big benefit from that.

see also, https://github.com/oxidecomputer/crucible/issues/1358 - what does having fewer syncs do for us?

Ry as a PR out to make it easier to build tuf repos with a custom helios - https://github.com/oxidecomputer/omicron/pull/6126 - I'm going to try to get this in and working on my own machine, and then I can experiment with tuning these.

oxidecomputer / crucible

Lock contention during `iodriver` runs #1337