openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.52k stars 1.74k forks source link

Add SPL module parameter to taskset ZFS threads to a predetermined cpumask or cgroup #15142

Open pebcakit opened 1 year ago

pebcakit commented 1 year ago

Describe the feature would like to see added to OpenZFS

OpenZFS should support to spawn kthreads within a cpumask or cgroup.

This would allow to reduce core contention in certain workloads, especially latency sensitive ones.

Currently, spl_taskq_thread_bind and spl_taskq_thread_dynamic can be used to limit the number and frequency of ZFS threads, plus preventing them from doing CPU migration by binding them to a fixed set of CPU.

However, there is not a parameter to specify a predetermined cpumask where to run these threads, cores are chosen from all the CPUs available on the system instead.

OpenZFS should consider adding a parameter to allow a cpumask bind.

How will this feature improve OpenZFS?

This feature would extend the current tuning module capabilities for CPU tuning; allowing for much better fine tuning so that OpenZFS could be restricted in low-latency workloads with high cpu contention.

In my specific case, running the same exact workload on XFS, reduced the application latency by ~50% and removed a huge part of the jitter caused by the cpu contention.

Addressing this feature would allow OpenZFS to be used more extensively in low-latency environments, just by moving/isolating its load elsewhere.

Additional context

By manually running taskset on $(pgrep -f 'z_|dp_'), I've seen great benefit in terms of latency for isolated threads that must wait on other threads that cannot be isolated and context switch very often because of ZFS.

By measuring with perf record -e sched:sched_switch is very easy to spot in a busy server, the amount of TASK_RUNNING that get de-scheduled in favor of z_wr_int for example. All this accounts for high nvcswch\s, visible with pidstat -w, leading into increased latency.

rincebrain commented 1 year ago

You'd probably want more granularity, like specifying the IO threads specifically, but yes, this is something I've wanted for a while.

snajpa commented 1 year ago

oh this would be brilliant, our intended use-case would be to assign zio taskq's to NUMA nodes and let the threads live their separate lives; we have solved it for now by this patch, which does the NUMA binding on its own, but it requires GPL-only symbols (which we have patched to not be GPL-only in our kernel build)

EDIT: though to make it so that we can use it that way, SPL module would have to take a param with a number of taskqs, ideally number of threads in every taskq and then most importantly, a bind mask for every such taskq (I vaguely remember I've thought of doing that but bailed to patching the GPL-only symbols to make my life easier, no parsing needed this way :D)

lowjoel commented 10 months ago

I tried to figure out how to add this, and digging through the source it seems that https://github.com/openzfs/zfs/blob/master/module/os/linux/spl/spl-taskq.c#L1059 uses kthread_bind but the associated kthread_bind_mask is not exported. So it seems like the only way around is to convert the CPU mask into a set of CPUs and then assign one kthread per permitted CPU.