ZFS Send & RaidZ - Poor performance on HDD (WD 18To)

akarzazi commented 1 year ago

Edit

It seems the issue affects only specific HDD drives as the WD 18To WUH721818AL5201 See comments below for details.

The Issue The performance of zfs send drops significantly when using Raidz layouts comparatively to the pool read/write performance.

To highlight the problem I use zfs send -L tank@1 | pv > /dev/null

The setup 10 HDD drives: 18 To @260MB/s

Here are some benchmarks to understand the problem.

Layout	ZFS send	Fio Read	Fio Write
stripe 10*1	830	1800	2800
mirror 5*2	900	1700	1500
raidz 1*10	180 !!	1300	1500
raidz 2*5	210 !!	1100	1700
raidz2 1*10	220 !!	1400	1600

Remarks:

The zfs receive does not exhibit the same behavior, speeds are fine.
This might go unnoticed on NVMe, I achieved 2 GB/s, so this might not be the bottleneck of the replication process.
draid2 seems to be doing a little (30% ) better than raidz2 in zfs send.

Reproduce the issue

Pool setup

# stripe 10*1
zpool create -O recordsize=1m  tank  /dev/disk/by-vdev/hdd{01..10}

# mirror  5*2 
zpool create -O recordsize=1m  tank mirror hdd01 hdd02  mirror hdd03 hdd04  mirror hdd05 hdd06  mirror hdd07 hdd08  mirror hdd09 hdd10

# raidz    
zpool create -O recordsize=1m  tank raidz /dev/disk/by-vdev/hdd{01..10}

# raidz    2*5
zpool create -O recordsize=1m  tank raidz /dev/disk/by-vdev/hdd{01..05} raidz /dev/disk/by-vdev/hdd{06..10}

# raidz2
zpool create -O recordsize=1m  tank raidz2 /dev/disk/by-vdev/hdd{01..10}

The test

# Creates a 16 GB file & Fio write

fio --ioengine=libaio --name=a --group_reporting=1 --eta-newline=1 --iodepth=16 --direct=1 --bs=1M --filename=/tank/a.dat --numjobs=4 --size=4G --offset_increment=4G --rw=write;

zfs snapshot tank@1;

 # Clear cache
zpool export tank; zpool import tank; 

# fio read
fio --ioengine=libaio --name=a --group_reporting=1 --eta-newline=1 --iodepth=16 --direct=1 --bs=1M --filename=/tank/a.dat --numjobs=4 --size=4G --offset_increment=4G --rw=read;

 # clear cache
zpool export tank; zpool import tank; 

# Test
zfs send -L tank@1  | pv > /dev/null

Context I use ZFS Send to replicate dataset. We have few TeraBytes written and ereased every day, it is a simple n day backup rotation. Until now I used mirror setups, but when trying a raidzx layout I noticed a very low performance with zfs send on HDD.

Unfortunately, with such low speeds, there no way we can replicate the data within the time window.

ZFS version

Ubuntu 22.04.2 LTS
zfs-2.1.5-1ubuntu6~22.04.1
zfs-kmod-2.1.5-1ubuntu6~22.04.1

Is the issue specific to my setup ? No.

I've reproduced the issue on a fresh install on another machine with a newer version of Ubuntu and ZFS.

Ubuntu 23.04
zfs-2.1.9-2ubuntu1
zfs-kmod-2.1.9-2ubuntu1

(edit) Also reproduced the same performance numbers on a fresh install of an old debian and older zfs

Debian GNU/Linux 10 (buster)
zfs-2.0.3-9~bpo10+1
zfs-kmod-2.0.3-9~bpo10+1

(edit) Also reproduced the same performance numbers on a fresh install of freebsd and zfs

FreeBSD 13.1-RELEASE
zfs-2.1.4-FreeBSD_g52bad4f23
zfs-kmod-2.1.4-FreeBSD_g52bad4f23

(edit) Reproduced on a compiled from sources on tag 2.1.11 (following this https://uptrace.dev/blog/ubuntu-install-zfs.html)

Ubuntu 22.04.2 LTS"
zfs-2.1.11-1
zfs-kmod-2.1.11-1

Summary I think this is a performance bug.

Do you experience the same issue ? Is there any flag to mitigate this ?

hardware details - Disk and controllers

``` # cpu Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz # lshw -class disk -class storage *-sas description: Serial Attached SCSI controller product: SAS3008 PCI-Express Fusion-MPT SAS-3 vendor: Broadcom / LSI physical id: 0 bus info: pci@0000:18:00.0 logical name: scsi0 version: 02 width: 64 bits clock: 33MHz capabilities: sas pm pciexpress msi msix bus_master cap_list rom configuration: driver=mpt3sas latency=0 resources: irq:40 ioport:5000(size=256) memory:dae40000-dae4ffff memory:dae00000-dae3ffff memory:dad00000-dadfffff *-disk:0 description: SCSI Disk product: WUH721818AL5201 vendor: WDC physical id: 0.8.0 bus info: scsi@0:0.8.0 logical name: /dev/sdk version: B680 serial: 4BHUUJ3V size: 16TiB (18TB) capacity: 19TiB (21TB) capabilities: 7200rpm gpt-1.00 partitioned partitioned:gpt configuration: ansiversion=7 guid=61117a9f-275b-f747-9eba-c82658a13ad5 logicalsectorsize=512 sectorsize=4096 ... # fdisk -l Disk /dev/sdu: 16.37 TiB, 18000207937536 bytes, 35156656128 sectors Disk model: WUH721818AL5201 Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes ```

rincebrain commented 1 year ago

830 already seems slow for 10 spinning disks, to be honest. What's the hardware involved, in more detail?

You may also find this germane, potentially.

akarzazi commented 1 year ago

I've tried out the init_on_alloc=0, but the results were the same.

Here is the procedure I took

- Edit grub, set GRUB_CMDLINE_LINUX_DEFAULT="init_on_alloc=0" - Run `update-grub` - Reboot - Check the setting was applied ``` cat /proc/cmdline BOOT_IMAGE=/vmlinuz-5.15.0-72-generic root=UUID=4a01f6be-3e53-4134-9339-dee90a38790d ro nomodeset iommu=pt console=tty0 console=ttyS1,115200n8 init_on_alloc=0 ```

I added some hardware detail (see edit), but i don't think it is hardware related.

edit: also tried with both init_on_alloc=0 init_on_free=0, but no improvement.

rincebrain commented 1 year ago

I suppose that's on me for not asking the right question.

"What are all of the pieces of hardware physically between the server CPU and the actual disks, including all SAS/SATA controllers and enclosures"?

Also, "what does zfs get all on the dataset you're sending say?"

I'm not saying it is or isn't a performance issue with ZFS specifically, just trying to understand the thing that's running.

You could also look at perf top while it's running, if it's spending most of its CPU time in system, and see if anything obviously stands out as the big timesink, though that's only going to notice things burning CPU cycles, not things, say, waiting on locks.

akarzazi commented 1 year ago

The physical machine is hosted in the cloud.

I ran few commands to extract SAS/SATA info. Let me know if there better commands to extract the desired information.

SAS/SATA info

``` # lspci | grep SATA 00:11.5 SATA controller: Intel Corporation C620 Series Chipset Family SSATA Controller [AHCI mode] (rev 09) 00:17.0 SATA controller: Intel Corporation C620 Series Chipset Family SATA Controller [AHCI mode] (rev 09) # lspci | grep SAS 18:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02) d8:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02) ```

Here is the output of perf top

``` 16.67% [kernel] [k] fletcher_4_avx512f_native 4.33% [kernel] [k] memcpy_erms 2.53% [kernel] [k] clear_page_erms 1.92% [kernel] [k] pipe_write 1.79% [kernel] [k] psi_group_change 1.72% [kernel] [k] try_charge_memcg 1.72% [kernel] [k] anon_pipe_buf_release 1.71% [kernel] [k] _raw_spin_lock_irq 1.57% [unknown] [.] 0000000000000000 1.44% [kernel] [k] obj_cgroup_charge_pages 1.37% [kernel] [k] __memcg_kmem_charge_page 1.36% [kernel] [k] menu_select 1.34% [kernel] [k] _raw_spin_lock_irqsave 1.13% [kernel] [k] page_counter_cancel 1.04% [kernel] [k] _raw_spin_lock 0.82% perf [.] dso__find_symbol 0.78% [kernel] [k] rmqueue_bulk ```

the zfs options are the default ones

ZFS get all

``` NAME PROPERTY VALUE SOURCE tank type filesystem - tank creation Wed May 31 8:52 2023 - tank used 15.3G - tank available 125T - tank referenced 15.3G - tank compressratio 1.00x - tank mounted yes - tank quota none default tank reservation none default tank recordsize 1M local tank mountpoint /tank default tank sharenfs off default tank checksum on default tank compression off default tank atime on default tank devices on default tank exec on default tank setuid on default tank readonly off default tank zoned off default tank snapdir hidden default tank aclmode discard default tank aclinherit restricted default tank createtxg 1 - tank canmount on default tank xattr on default tank copies 1 default tank version 5 - tank utf8only off - tank normalization none - tank casesensitivity sensitive - tank vscan off default tank nbmand off default tank sharesmb off default tank refquota none default tank refreservation none default tank guid 7333539194275455043 - tank primarycache all default tank secondarycache all default tank usedbysnapshots 128K - tank usedbydataset 15.3G - tank usedbychildren 1.31M - tank usedbyrefreservation 0B - tank logbias latency default tank objsetid 54 - tank dedup off default tank mlslabel none default tank sync standard default tank dnodesize legacy default tank refcompressratio 1.00x - tank written 128K - tank logicalused 16.0G - tank logicalreferenced 16.0G - tank volmode default default tank filesystem_limit none default tank snapshot_limit none default tank filesystem_count none default tank snapshot_count none default tank snapdev hidden default tank acltype off default tank context none default tank fscontext none default tank defcontext none default tank rootcontext none default tank relatime off default tank redundant_metadata all default tank overlay on default tank encryption off default tank keylocation none default tank keyformat none default tank pbkdf2iters 0 default tank special_small_blocks 0 default tank@1 type snapshot - tank@1 creation Wed May 31 8:53 2023 - tank@1 used 128K - tank@1 referenced 15.3G - tank@1 compressratio 1.00x - tank@1 devices on default tank@1 exec on default tank@1 setuid on default tank@1 createtxg 17 - tank@1 xattr on default tank@1 version 5 - tank@1 utf8only off - tank@1 normalization none - tank@1 casesensitivity sensitive - tank@1 nbmand off default tank@1 guid 15846016211071391827 - tank@1 primarycache all default tank@1 secondarycache all default tank@1 defer_destroy off - tank@1 userrefs 0 - tank@1 objsetid 896 - tank@1 mlslabel none default tank@1 refcompressratio 1.00x - tank@1 written 15.3G - tank@1 logicalreferenced 16.0G - tank@1 acltype off default tank@1 context none default tank@1 fscontext none default tank@1 defcontext none default tank@1 rootcontext none default tank@1 encryption off default ```

rincebrain commented 1 year ago

You could try twiddling /sys/module/zfs/parameters/zfs_fletcher_4_impl from fastest to avx2 - it tries to pick which is fastest based on a microbenchmark at load time, and it's not impossible that the AVX512F implementation is faster briefly but slower in a sustained workload.

You could also try setting the environment variable ZFS_SET_PIPE_MAX, but to quote the man page:

     ZFS_SET_PIPE_MAX  Tells zfs to set the maximum pipe size for sends/recieves.  Disabled by default on Linux due to an unfixed deadlock in Linux's pipe size handling code.

akarzazi commented 1 year ago

Yes the micro benchmark shows avx512bw as the chosen and fastest.

cat /proc/spl/kstat/zfs/fletcher_4_bench

``` cat /proc/spl/kstat/zfs/fletcher_4_bench 0 0 0x01 -1 0 21598536265 15069737193645 implementation native byteswap scalar 5251545049 2656689285 superscalar 7060244229 2973279287 superscalar4 6266636984 3586653212 sse2 11312201134 4270711295 ssse3 11560554373 8832345013 avx2 17953485266 14114608098 avx512f 26607164509 9616000745 avx512bw 27205304640 23023088203 fastest avx512bw avx512bw ```

After choosing avx2 explicitely perf top shows that avx2 is used indeed.

21.74%  [kernel]                 [k] fletcher_4_avx2_native
 4.12%  [kernel]                 [k] memcpy_erms
 3.24%  [kernel]                 [k] clear_page_erms

But it did not improve the perfs. I did not expect neither the avx512 to avx2 change to bring back the 80% perf loss.

export ZFS_SET_PIPE_MAX=1, did not help neither.

Did you try to reproduce the issue ?

akarzazi commented 1 year ago

I reproduced this issue in multiple environments and across several versions of ZFS from 2.0.3 to 2.1.11. (see edits)

akarzazi commented 1 year ago

It seems that the issue is impacting some HDD.

❌WUH721818AL5201 EAMR WD Ultrastar DC HC550 WUH721818AL5201 - 18 TB https://documents.westerndigital.com/content/dam/doc-library/en_us/assets/public/western-digital/product/data-center-drives/ultrastar-dc-hc500-series/product-manual-ultrastar-dc-hc550-sas-oem-spec.pdf

✅WUH721414AL5201 CMR WD Ultrastar DC HC530 WUH721414AL5201 - 14 TB https://documents.westerndigital.com/content/dam/doc-library/en_us/assets/public/western-digital/product/data-center-drives/ultrastar-dc-hc500-series/data-sheet-ultrastar-dc-hc530.pdf

✅ ST18000NM005J CMR Seagate Exos X18 ST18000NM005J - 18TB https://www.seagate.com/files/www-content/datasheets/pdfs/exos-x18-channel-DS2045-1-2007GB-en_SG.pdf

Here is comparison benchmark with raidz1 on 4 disk, all drives are one the same computer, same controller (see hardware details). zpool create -O recordsize=1m -o ashift=12 tank raidz1 hdd{01..04}

hdd type	ZFS send	Fio Read	Fio Write
WUH721818AL5201	❌180	600	900
WUH721414AL5201	715	620	900
ST18000NM005J	760	680	990

According to Western Digital, the ❌WUH721818AL5201, does not seem to use SMR, but EAMR. The other drives are clearly mentioned as CMR by the manufacturer.

What is even stranger, is that the normal workload is performing well, even rebuilds are as fast as the drive can be (250 mb/s), the zfs sendis the only workload that shows a degradation.

There maybe something wrong with EAMR. is anybody experiencing the same issue ? Any idea ?

shodanshok commented 1 year ago

Try lowering zfs_pd_bytes_max to 8 MB or less.

If it does not change anything, try disabling NCQ via echo 1 > /sys/block/sd*/device/queue_depth

akarzazi commented 1 year ago

@shodanshok None of them worked With NCQ disabled, the zfs send workload was worse 70MB/s.

akarzazi commented 1 year ago

Using smartctl -a , I see a lot of Correction algorithm invocations on the WD 18To EAMR WUH721818AL5201 models. Each time I run the command zfs send -L tank@1 | pv > /dev/null , the count of "Correction algorithm invocations" increases by more than 1000 for a 16Gb send.

I see the same behavior on the 10 others disks of this model.

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0      10151        225.919           0
write:         0        0         0         0       9014       5834.570           0
verify:        0        0         0         0        155          0.000           0

Workloads other than zfs send increase this value by a little.

No errors are reported the Seagates disks.

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0        197.682           0
write:         0        0         0         0          0       1258.179           0

shodanshok commented 1 year ago

Does the same Correction algorithm invocations counter increase during a plain, raw read from the disk? ie: when running something as dd if=/dev/youreamrdisk of=/dev/null bs=1M count=16384 iflag=direct

akarzazi commented 1 year ago

Yes it does increase, here in a sample for a 16 GB file spread on a raidz1 4 drives array.

Before

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0      12226        245.446           0
write:         0        0         0         0      16575      10377.296           0
verify:        0        0         0         0        288          0.000           0

Run

dd if=/tank/a.dat of=/dev/null bs=1M count=16384 iflag=direct
16384+0 records in
16384+0 records out
17179869184 bytes (17 GB, 16 GiB) copied, 61.5346 s, 279 MB/s

After

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0      12815        249.735           0
write:         0        0         0         0      16575      10377.301           0
verify:        0        0         0         0        288          0.000           0

We see nearly 600 invocations for 4 GB read on the disk. What does this means ?

shodanshok commented 1 year ago

Yes it does increase, here in a sample for a 16 GB file spread on a raidz1 4 drives array.

Ok, so it should not matter for zfs send

We see nearly 600 invocations for 4 GB read on the disk. What does this means ?

Very little, unfortunately. The vendor can use this (and other) fields in uncommon ways.

Can you try setting zfs_traverse_indirect_prefetch_limit=1024 and restoring zfs_pd_bytes_max=52428800 (ie: its initial value)?

akarzazi commented 1 year ago

@shodanshok it did not help.

zfs_pd_bytes_max was already 52428800. zfs_traverse_indirect_prefetch_limit was 32 echo 1024 > /sys/module/zfs/parameters/zfs_traverse_indirect_prefetch_limit

maybe updating the firmware can help, but there is no documentation on how to update. https://support-en.westerndigital.com/app/answers/detail/a_id/29514

akarzazi commented 6 months ago

Western digital released a new firmware that fixed the issue.

openzfs / zfs

ZFS Send & RaidZ - Poor performance on HDD (WD 18To) #14917