openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.44k stars 1.73k forks source link

The HDD volumes can significantly affect the write performance of SSD volumes on the same server node #16518

Open TimLand opened 1 week ago

TimLand commented 1 week ago

System information

Type Version/Name
Distribution Name Oracle Linux 7
Distribution Version 7.9
Kernel Version 5.4.17
Architecture x86_64
OpenZFS Version 2.1.15

Describe the problem you're observing

The write performance of SSD volumes on the same server node can be severely impacted by HDD volumes. When creating RAID-Z storage pools with entirely mechanical drives (HDD) and all-SSD respectively on the same server node, and then performing sequential writes to the volumes generated by these two pools simultaneously, it was observed that the sequential write speed of the SSD volume only reached 300 MB/s. When writing to the HDD volume was stopped, the write speed of the SSD volume returned to its normal rate of 1600 MB/s.

Describe how to reproduce the problem

On the same node, create a RAID 5 storage pool named hdd_pool using HDD drives, and another RAID 5 storage pool named ssd_pool using SSD drives. Then, create a volume called hdd_volume based on hdd_pool and a volume called ssd_volume based on ssd_pool. When performing simultaneous write operations using fio on both hdd_volume and ssd_volume, it is observed that the write speed of ssd_volume is only 300 MB/s and the write speed of hdd_volume is 287 MB/s. When the fio write operation on hdd_volume is stopped, the write speed of ssd_volume increases to 1600 MB/s.

The fio log indicates the following: fio --name=test --rw=write --direct=1 --numjobs=1 --ioengine=libaio --iodepth=64 --bs=1M --group_reporting --runtime=60 --size=100G --filename=/dev/zd0 WRITE:io=891904KB,aggrb=200113KB/s,minb=200113KB/s,maxb=200113KB/s,mint=4457msec,maxt=4457msec

fio --name=test --rw=write --direct=1 --numjobs=1 --ioengine=libaio --iodepth=64 --bs=1M --group_reporting --runtime=60 --size=100G --filename=/dev/zd0 WRITE: io=100751MB,aggrb=1678.5MB/s,minb=1678.5MB/s,maxb=1678.5MB/s,mint=60026msec,maxt=60026msec

snajpa commented 1 week ago

How does the CPU usage look like during the HDD pool usage?

What's your preempt model now?

cat /sys/kernel/debug/sched/preempt

can you try switching to full if not already there?

it'd be really helpful if you could record fio and the ZFS threads with perf and then if you could produce flamegraphs

but let's start with the preempt model - why I'm saying it could play a role: I suspect there's a possibility that processes are spinning on-cpu while data is being waited on and that this spinning eats up CPU resources for other activity...

(otherwise I have no idea on what could be happening so it's probably going to require a bit of back and forth)

TimLand commented 1 week ago

Thanks, these are the flame graphs we've collected: The following graph shows the results of running fio with the IO engine using libaio hdd_ssd_vol_fio_with_libaio_write

The following graph shows the results of running fio with the IO engine using psync hdd_ssd_vol_fio_with_psync_write

I have observed that when FIO uses the psync parameter, the performance of the SSD volume seems normal and is not impacted by the HDD volume. Does libaio maintain a global complete queue?When there are slow block devices, would this affect the speed at which elements are reaped from the complete queue?

amotin commented 1 week ago

Since ZFS does not really support async I/O, it can only execute as many simultaneous I/Os as there are threads to issue them. Those can be either kernel threads or ZFS/ZVOL, but neither are infinite and can create a bottleneck. Threading model for ZVOLs, if that is the case, was reworked recently in ZFS master in https://github.com/openzfs/zfs/pull/15992. Meanwhile you may try to experiment with zvol_threads module parameter.