Open pebcakit opened 1 year ago
You'd probably want more granularity, like specifying the IO threads specifically, but yes, this is something I've wanted for a while.
oh this would be brilliant, our intended use-case would be to assign zio taskq's to NUMA nodes and let the threads live their separate lives; we have solved it for now by this patch, which does the NUMA binding on its own, but it requires GPL-only symbols (which we have patched to not be GPL-only in our kernel build)
EDIT: though to make it so that we can use it that way, SPL module would have to take a param with a number of taskqs, ideally number of threads in every taskq and then most importantly, a bind mask for every such taskq (I vaguely remember I've thought of doing that but bailed to patching the GPL-only symbols to make my life easier, no parsing needed this way :D)
I tried to figure out how to add this, and digging through the source it seems that https://github.com/openzfs/zfs/blob/master/module/os/linux/spl/spl-taskq.c#L1059 uses kthread_bind
but the associated kthread_bind_mask
is not exported. So it seems like the only way around is to convert the CPU mask into a set of CPUs and then assign one kthread per permitted CPU.
Describe the feature would like to see added to OpenZFS
OpenZFS should support to spawn kthreads within a cpumask or cgroup.
This would allow to reduce core contention in certain workloads, especially latency sensitive ones.
Currently,
spl_taskq_thread_bind
andspl_taskq_thread_dynamic
can be used to limit the number and frequency of ZFS threads, plus preventing them from doing CPU migration by binding them to a fixed set of CPU.However, there is not a parameter to specify a predetermined cpumask where to run these threads, cores are chosen from all the CPUs available on the system instead.
OpenZFS should consider adding a parameter to allow a cpumask bind.
How will this feature improve OpenZFS?
This feature would extend the current tuning module capabilities for CPU tuning; allowing for much better fine tuning so that OpenZFS could be restricted in low-latency workloads with high cpu contention.
In my specific case, running the same exact workload on XFS, reduced the application latency by ~50% and removed a huge part of the jitter caused by the cpu contention.
Addressing this feature would allow OpenZFS to be used more extensively in low-latency environments, just by moving/isolating its load elsewhere.
Additional context
By manually running
taskset
on$(pgrep -f 'z_|dp_')
, I've seen great benefit in terms of latency for isolated threads that must wait on other threads that cannot be isolated and context switch very often because of ZFS.By measuring with
perf record -e sched:sched_switch
is very easy to spot in a busy server, the amount ofTASK_RUNNING
that get de-scheduled in favor ofz_wr_int
for example. All this accounts for highnvcswch\s
, visible withpidstat -w
, leading into increased latency.