Closed akarzazi closed 6 months ago
830 already seems slow for 10 spinning disks, to be honest. What's the hardware involved, in more detail?
You may also find this germane, potentially.
I've tried out the init_on_alloc=0
, but the results were the same.
- Edit grub, set GRUB_CMDLINE_LINUX_DEFAULT="init_on_alloc=0" - Run `update-grub` - Reboot - Check the setting was applied ``` cat /proc/cmdline BOOT_IMAGE=/vmlinuz-5.15.0-72-generic root=UUID=4a01f6be-3e53-4134-9339-dee90a38790d ro nomodeset iommu=pt console=tty0 console=ttyS1,115200n8 init_on_alloc=0 ```
I added some hardware detail (see edit), but i don't think it is hardware related.
edit:
also tried with both init_on_alloc=0 init_on_free=0
, but no improvement.
I suppose that's on me for not asking the right question.
"What are all of the pieces of hardware physically between the server CPU and the actual disks, including all SAS/SATA controllers and enclosures"?
Also, "what does zfs get all
on the dataset you're sending say?"
I'm not saying it is or isn't a performance issue with ZFS specifically, just trying to understand the thing that's running.
You could also look at perf top
while it's running, if it's spending most of its CPU time in system, and see if anything obviously stands out as the big timesink, though that's only going to notice things burning CPU cycles, not things, say, waiting on locks.
The physical machine is hosted in the cloud.
I ran few commands to extract SAS/SATA info. Let me know if there better commands to extract the desired information.
``` # lspci | grep SATA 00:11.5 SATA controller: Intel Corporation C620 Series Chipset Family SSATA Controller [AHCI mode] (rev 09) 00:17.0 SATA controller: Intel Corporation C620 Series Chipset Family SATA Controller [AHCI mode] (rev 09) # lspci | grep SAS 18:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02) d8:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02) ```
``` 16.67% [kernel] [k] fletcher_4_avx512f_native 4.33% [kernel] [k] memcpy_erms 2.53% [kernel] [k] clear_page_erms 1.92% [kernel] [k] pipe_write 1.79% [kernel] [k] psi_group_change 1.72% [kernel] [k] try_charge_memcg 1.72% [kernel] [k] anon_pipe_buf_release 1.71% [kernel] [k] _raw_spin_lock_irq 1.57% [unknown] [.] 0000000000000000 1.44% [kernel] [k] obj_cgroup_charge_pages 1.37% [kernel] [k] __memcg_kmem_charge_page 1.36% [kernel] [k] menu_select 1.34% [kernel] [k] _raw_spin_lock_irqsave 1.13% [kernel] [k] page_counter_cancel 1.04% [kernel] [k] _raw_spin_lock 0.82% perf [.] dso__find_symbol 0.78% [kernel] [k] rmqueue_bulk ```
the zfs options are the default ones
``` NAME PROPERTY VALUE SOURCE tank type filesystem - tank creation Wed May 31 8:52 2023 - tank used 15.3G - tank available 125T - tank referenced 15.3G - tank compressratio 1.00x - tank mounted yes - tank quota none default tank reservation none default tank recordsize 1M local tank mountpoint /tank default tank sharenfs off default tank checksum on default tank compression off default tank atime on default tank devices on default tank exec on default tank setuid on default tank readonly off default tank zoned off default tank snapdir hidden default tank aclmode discard default tank aclinherit restricted default tank createtxg 1 - tank canmount on default tank xattr on default tank copies 1 default tank version 5 - tank utf8only off - tank normalization none - tank casesensitivity sensitive - tank vscan off default tank nbmand off default tank sharesmb off default tank refquota none default tank refreservation none default tank guid 7333539194275455043 - tank primarycache all default tank secondarycache all default tank usedbysnapshots 128K - tank usedbydataset 15.3G - tank usedbychildren 1.31M - tank usedbyrefreservation 0B - tank logbias latency default tank objsetid 54 - tank dedup off default tank mlslabel none default tank sync standard default tank dnodesize legacy default tank refcompressratio 1.00x - tank written 128K - tank logicalused 16.0G - tank logicalreferenced 16.0G - tank volmode default default tank filesystem_limit none default tank snapshot_limit none default tank filesystem_count none default tank snapshot_count none default tank snapdev hidden default tank acltype off default tank context none default tank fscontext none default tank defcontext none default tank rootcontext none default tank relatime off default tank redundant_metadata all default tank overlay on default tank encryption off default tank keylocation none default tank keyformat none default tank pbkdf2iters 0 default tank special_small_blocks 0 default tank@1 type snapshot - tank@1 creation Wed May 31 8:53 2023 - tank@1 used 128K - tank@1 referenced 15.3G - tank@1 compressratio 1.00x - tank@1 devices on default tank@1 exec on default tank@1 setuid on default tank@1 createtxg 17 - tank@1 xattr on default tank@1 version 5 - tank@1 utf8only off - tank@1 normalization none - tank@1 casesensitivity sensitive - tank@1 nbmand off default tank@1 guid 15846016211071391827 - tank@1 primarycache all default tank@1 secondarycache all default tank@1 defer_destroy off - tank@1 userrefs 0 - tank@1 objsetid 896 - tank@1 mlslabel none default tank@1 refcompressratio 1.00x - tank@1 written 15.3G - tank@1 logicalreferenced 16.0G - tank@1 acltype off default tank@1 context none default tank@1 fscontext none default tank@1 defcontext none default tank@1 rootcontext none default tank@1 encryption off default ```
You could try twiddling /sys/module/zfs/parameters/zfs_fletcher_4_impl
from fastest
to avx2
- it tries to pick which is fastest based on a microbenchmark at load time, and it's not impossible that the AVX512F implementation is faster briefly but slower in a sustained workload.
You could also try setting the environment variable ZFS_SET_PIPE_MAX
, but to quote the man page:
ZFS_SET_PIPE_MAX Tells zfs to set the maximum pipe size for sends/recieves. Disabled by default on Linux due to an unfixed deadlock in Linux's pipe size handling code.
Yes the micro benchmark shows avx512bw as the chosen and fastest.
``` cat /proc/spl/kstat/zfs/fletcher_4_bench 0 0 0x01 -1 0 21598536265 15069737193645 implementation native byteswap scalar 5251545049 2656689285 superscalar 7060244229 2973279287 superscalar4 6266636984 3586653212 sse2 11312201134 4270711295 ssse3 11560554373 8832345013 avx2 17953485266 14114608098 avx512f 26607164509 9616000745 avx512bw 27205304640 23023088203 fastest avx512bw avx512bw ```
After choosing avx2 explicitely perf top
shows that avx2 is used indeed.
21.74% [kernel] [k] fletcher_4_avx2_native
4.12% [kernel] [k] memcpy_erms
3.24% [kernel] [k] clear_page_erms
But it did not improve the perfs. I did not expect neither the avx512 to avx2 change to bring back the 80% perf loss.
export ZFS_SET_PIPE_MAX=1
, did not help neither.
Did you try to reproduce the issue ?
I reproduced this issue in multiple environments and across several versions of ZFS from 2.0.3 to 2.1.11. (see edits)
It seems that the issue is impacting some HDD.
❌WUH721818AL5201 EAMR WD Ultrastar DC HC550 WUH721818AL5201 - 18 TB https://documents.westerndigital.com/content/dam/doc-library/en_us/assets/public/western-digital/product/data-center-drives/ultrastar-dc-hc500-series/product-manual-ultrastar-dc-hc550-sas-oem-spec.pdf
✅WUH721414AL5201 CMR WD Ultrastar DC HC530 WUH721414AL5201 - 14 TB https://documents.westerndigital.com/content/dam/doc-library/en_us/assets/public/western-digital/product/data-center-drives/ultrastar-dc-hc500-series/data-sheet-ultrastar-dc-hc530.pdf
✅ ST18000NM005J CMR Seagate Exos X18 ST18000NM005J - 18TB https://www.seagate.com/files/www-content/datasheets/pdfs/exos-x18-channel-DS2045-1-2007GB-en_SG.pdf
Here is comparison benchmark with raidz1 on 4 disk
, all drives are one the same computer, same controller (see hardware details).
zpool create -O recordsize=1m -o ashift=12 tank raidz1 hdd{01..04}
hdd type | ZFS send | Fio Read | Fio Write |
---|---|---|---|
WUH721818AL5201 | ❌180 | 600 | 900 |
WUH721414AL5201 | 715 | 620 | 900 |
ST18000NM005J | 760 | 680 | 990 |
According to Western Digital, the ❌WUH721818AL5201, does not seem to use SMR, but EAMR. The other drives are clearly mentioned as CMR by the manufacturer.
What is even stranger, is that the normal workload is performing well, even rebuilds are as fast as the drive can be (250 mb/s), the zfs send
is the only workload that shows a degradation.
There maybe something wrong with EAMR. is anybody experiencing the same issue ? Any idea ?
Try lowering zfs_pd_bytes_max
to 8 MB or less.
If it does not change anything, try disabling NCQ via echo 1 > /sys/block/sd*/device/queue_depth
@shodanshok
None of them worked
With NCQ disabled, the zfs send
workload was worse 70MB/s.
Using smartctl -a
, I see a lot of Correction algorithm invocations on the WD 18To EAMR WUH721818AL5201 models.
Each time I run the command zfs send -L tank@1 | pv > /dev/null
, the count of "Correction algorithm invocations" increases by more than 1000 for a 16Gb send.
I see the same behavior on the 10 others disks of this model.
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 10151 225.919 0
write: 0 0 0 0 9014 5834.570 0
verify: 0 0 0 0 155 0.000 0
Workloads other than zfs send
increase this value by a little.
No errors are reported the Seagates disks.
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 0 197.682 0
write: 0 0 0 0 0 1258.179 0
Does the same Correction algorithm invocations
counter increase during a plain, raw read from the disk? ie: when running something as dd if=/dev/youreamrdisk of=/dev/null bs=1M count=16384 iflag=direct
Yes it does increase, here in a sample for a 16 GB file spread on a raidz1 4 drives array.
Before
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 12226 245.446 0
write: 0 0 0 0 16575 10377.296 0
verify: 0 0 0 0 288 0.000 0
Run
dd if=/tank/a.dat of=/dev/null bs=1M count=16384 iflag=direct
16384+0 records in
16384+0 records out
17179869184 bytes (17 GB, 16 GiB) copied, 61.5346 s, 279 MB/s
After
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 12815 249.735 0
write: 0 0 0 0 16575 10377.301 0
verify: 0 0 0 0 288 0.000 0
We see nearly 600 invocations for 4 GB read on the disk. What does this means ?
Yes it does increase, here in a sample for a 16 GB file spread on a raidz1 4 drives array.
Ok, so it should not matter for zfs send
We see nearly 600 invocations for 4 GB read on the disk. What does this means ?
Very little, unfortunately. The vendor can use this (and other) fields in uncommon ways.
Can you try setting zfs_traverse_indirect_prefetch_limit=1024
and restoring zfs_pd_bytes_max=52428800
(ie: its initial value)?
@shodanshok it did not help.
zfs_pd_bytes_max
was already 52428800.
zfs_traverse_indirect_prefetch_limit
was 32
echo 1024 > /sys/module/zfs/parameters/zfs_traverse_indirect_prefetch_limit
maybe updating the firmware can help, but there is no documentation on how to update. https://support-en.westerndigital.com/app/answers/detail/a_id/29514
Western digital released a new firmware that fixed the issue.
Edit
It seems the issue affects only specific HDD drives as the WD 18To WUH721818AL5201 See comments below for details.
The Issue The performance of
zfs send
drops significantly when using Raidz layouts comparatively to the pool read/write performance.To highlight the problem I use
zfs send -L tank@1 | pv > /dev/null
The setup 10 HDD drives: 18 To @260MB/s
Here are some benchmarks to understand the problem.
Remarks:
The
zfs receive
does not exhibit the same behavior, speeds are fine.This might go unnoticed on NVMe, I achieved 2 GB/s, so this might not be the bottleneck of the replication process.
draid2 seems to be doing a little (30% ) better than raidz2 in zfs send.
Reproduce the issue
Pool setup
The test
Context I use ZFS Send to replicate dataset. We have few TeraBytes written and ereased every day, it is a simple n day backup rotation. Until now I used
mirror
setups, but when trying araidzx
layout I noticed a very low performance withzfs send
on HDD.Unfortunately, with such low speeds, there no way we can replicate the data within the time window.
ZFS version
Is the issue specific to my setup ? No.
I've reproduced the issue on a fresh install on another machine with a newer version of Ubuntu and ZFS.
(edit) Also reproduced the same performance numbers on a fresh install of an old debian and older zfs
(edit) Also reproduced the same performance numbers on a fresh install of freebsd and zfs
(edit) Reproduced on a compiled from sources on tag 2.1.11 (following this https://uptrace.dev/blog/ubuntu-install-zfs.html)
Summary I think this is a performance bug.
Do you experience the same issue ? Is there any flag to mitigate this ?
hardware details - Disk and controllers
``` # cpu Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz # lshw -class disk -class storage *-sas description: Serial Attached SCSI controller product: SAS3008 PCI-Express Fusion-MPT SAS-3 vendor: Broadcom / LSI physical id: 0 bus info: pci@0000:18:00.0 logical name: scsi0 version: 02 width: 64 bits clock: 33MHz capabilities: sas pm pciexpress msi msix bus_master cap_list rom configuration: driver=mpt3sas latency=0 resources: irq:40 ioport:5000(size=256) memory:dae40000-dae4ffff memory:dae00000-dae3ffff memory:dad00000-dadfffff *-disk:0 description: SCSI Disk product: WUH721818AL5201 vendor: WDC physical id: 0.8.0 bus info: scsi@0:0.8.0 logical name: /dev/sdk version: B680 serial: 4BHUUJ3V size: 16TiB (18TB) capacity: 19TiB (21TB) capabilities: 7200rpm gpt-1.00 partitioned partitioned:gpt configuration: ansiversion=7 guid=61117a9f-275b-f747-9eba-c82658a13ad5 logicalsectorsize=512 sectorsize=4096 ... # fdisk -l Disk /dev/sdu: 16.37 TiB, 18000207937536 bytes, 35156656128 sectors Disk model: WUH721818AL5201 Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes ```