zpool initialize speed is significantly slower after initialize suspend / resume

malventano commented 2 years ago

System information

Type	Version/Name
Distribution Name	Proxmox VE
Distribution Version	7.3-3
Kernel Version	5.15.74-1-pve
Architecture	x86_64
OpenZFS Version	2.1.6-pve1

Describe the problem you're observing

If a large HDD array initialization is suspended and then resumed, the initialization speed after resume will be extremely slow.

I've repro'd this 3x now. My example pre > post speeds were 3GB/s > 60MB/s.
Speeds on a wider pool dropped from 8GB/s to 300MB/s.
Reduced speed was observed for days after the resume (assuming indefinitely).
Nothing abnormal in logs, iowaits were fine on all disks, and the zpool was idle (no host activity).
zpool iostat -vl test 1 shows all disks to have equal stats, and 1-sec totals range from 0 - ~500MB/s.
Suspending the initialize again, rebooting, and resuming results in same slow speed.
Destroying / creating the zpool with no other changes (no reboot even) results in initialization speed returning to 3GB/s.
Suspend was at 48% on the first attempt, then recreated the pool and went to 60% for the second attempt (to repro).
24+ hours later init has increased by 1% on an otherwise idle system.
I first stumbled upon this bug when an initialize was halted due to operator (HBA cable was pulled) and noting slow progress after issuing a clear, but repro'd it 2x by manually suspending/resuming an initialize after some time.
Note that a separate HDD FW issue complicated findings earlier, but I have isolated that and reproduced on a different set/model of drives and enclosures.

Describe how to reproduce the problem

zpool create test -o ashift=12 -O compression=on -O atime=off -O xattr=sa -O recordsize=1M raidz3 ... zpool initialize test (allow to progress for a few hours) zpool initialize -s test (optional reboot) zpool initialize test (note that after several hours, array write speed will be significantly slower)

Include any warning/errors/backtraces from the system logs

None to report yet. I'm leaving the system in the current state (very slow init still at 61%) in case any replies suggest steps to elaborate while the issue exists. Else I'll destroy/create with smaller HDD counts to repro additional/simpler configs for further elaboration.

Notes

This may be related to https://github.com/openzfs/zfs/issues/11948

malventano commented 2 years ago

Additional info:

After leaving the (slow) resumed initialize going for while (>1d), I've noticed the percent complete is beginning to stratify across the drives. Pre-suspend, all drives would always be on the same % initialized, but now I see some at 61%/62% (randomly distributed), a few at 63%, and the last drive of the pool is now at 64%.

malventano commented 2 years ago

Here's a sampling of the % initialized before I threw in the towel on that attempt (it would have taken months to complete). Note that this spread began from all drives being at 60% prior to the suspend/resume:
```
64%
62%
61%
63%
62%
61%
```
Did a few more destroy/creates and noted that if the initialize is suspended earlier (1%), there is less impact on speed, but something is still off about how the process continues after the resume. My initial discovery up to this point was with a wide 48-drive raidz3, but I'm currently moving onto other vdev layouts / drive counts for further elaboration.

malventano commented 2 years ago

Updated to OpenZFS 2.1.5 and increased BW to the backplanes to speed up troubleshooting.

Initialize speed: 5.5 GB/s (averaged over 1 minute)
After suspend/resume at 1%: noted high iowaits for ~5 minutes while speed slowly increased back to 5.5 GB/s
Same occurred at 6%, 36%, and 85%.
All seemed good - until I attempted to write data to the zpool.
While it did not occur during this round of init, iowaits went crazy while writing (after init was complete). iostat -w showed some drives in the pool hitting 8s in the histogram.
Traced issue back to the use of WDC HC550 16TB HDDs with older R120 FW. Drives on R232 and up appeared fine. Updating all drives to 680 FW corrected the issue.
I've used many other HGST/WDC drives in pools over the years, but this was the first time I saw FW cause such a ZFS perf issue. If anyone else stumbles upon this thread and was using the same config, there's your answer for oddball perf.

malventano commented 1 year ago

Revisiting this on a new pool on a different set of drives/enclosures (different make/models). I was able to replicate this issue by rebooting the server at 67% initialized. Pool was initializing at ~8GB/s prior to the reboot. After reboot it has been ~300 MB/s steady for the past 8 hours. Something about resuming a zpool initializion is not playing nicely.

malventano commented 1 year ago

Poking at this issue further, I tried some suspending / resuming of the initialize and saw the same slow speed. The behavior is odd in that lots of CPU+iowait occurs when first resuming the initialize, presumably while vdev_initialize threads are running some loop to count up to where the scrub left off (there is no actual disk IO during this period). After a few minutes the disk IO begins and initialize resumes, but extremely slowly, and watching disk activity some disks are getting smaller IOs (e.g. 17 IOPS / a few dozen kB/s) while other drives are going faster (e.g. 25 IOPS / 25 MB/s). The faster speeds appear to rotate around to different disks every few seconds. During this time, the CPU is fully idle and iowaits negligible. zpool iostat -vw test output confirms that no device latencies exceeded 1/2s (for the entire session) and most IOs were in the microsecond range, so the slowness is not coming from the drives as was the case with the FW issue drives seen earlier in this thread (turns out that issue was just coaxing this behavior more quickly). The behavior feels like something about the initialize is trying to equalize progress across drives but is getting tripped up, but that's just a guess.

Attempting to elaborate further, I did the following;

zpool initialize -c test
(wait a minute)
zpool initialize test
(note that high CPU / iowait returned as expected for the start of the initialize)
zpool initialize -c test (this time canceling about a minute into the high CPU but before disk activity resumed)

That last cancel attempt resulted in a prompt hung for nearly an hour. Journalctl entry seen during the hang is pasted below.

...so on one hand we have zpool initialization slowing to the point of effectively taking an indefinite time to complete if the system reboots or the initialize is otherwise interrupted, and on the other hand, we have something about the resume attempts going sideways if the user attempts to cancel the initialize too quickly (which is a likely result should the user notice the heavy resource spike).

These issues appear to be amplified with larger arrays, where the user needs to commit to 100% uptime and zero disruptions until complete, or else they risk an interrupted initialization becoming effectively impossible to complete (multiple months at the slower rate) and/or being forced to reboot and then only able to stop the initialize after the high CPU use portion of the resume.

Dec 13 19:14:51 BB-8 kernel: INFO: task zpool:572773 blocked for more than 241 seconds.
Dec 13 19:14:51 BB-8 kernel:       Tainted: P           O      5.15.74-1-pve #1
Dec 13 19:14:51 BB-8 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 13 19:14:51 BB-8 kernel: task:zpool           state:D stack:    0 pid:572773 ppid:265552 flags:0x00000000
Dec 13 19:14:51 BB-8 kernel: Call Trace:
Dec 13 19:14:51 BB-8 kernel:  <TASK>
Dec 13 19:14:51 BB-8 kernel:  __schedule+0x34e/0x1740
Dec 13 19:14:51 BB-8 kernel:  ? __unfreeze_partials+0x134/0x180
Dec 13 19:14:51 BB-8 kernel:  ? cpumask_next_wrap+0x33/0x90
Dec 13 19:14:51 BB-8 kernel:  ? cpumask_next+0x23/0x30
Dec 13 19:14:51 BB-8 kernel:  schedule+0x69/0x110
Dec 13 19:14:51 BB-8 kernel:  schedule_preempt_disabled+0xe/0x20
Dec 13 19:14:51 BB-8 kernel:  __mutex_lock.constprop.0+0x255/0x480
Dec 13 19:14:51 BB-8 kernel:  __mutex_lock_slowpath+0x13/0x20
Dec 13 19:14:51 BB-8 kernel:  mutex_lock+0x38/0x50
Dec 13 19:14:51 BB-8 kernel:  spa_open_common+0x61/0x450 [zfs]
Dec 13 19:14:51 BB-8 kernel:  spa_get_stats+0x5b/0x4d0 [zfs]
Dec 13 19:14:51 BB-8 kernel:  ? __kmalloc_node+0x166/0x3a0
Dec 13 19:14:51 BB-8 kernel:  zfs_ioc_pool_stats+0x39/0x90 [zfs]
Dec 13 19:14:51 BB-8 kernel:  zfsdev_ioctl_common+0x764/0x9e0 [zfs]
Dec 13 19:14:51 BB-8 kernel:  ? schedule+0x69/0x110
Dec 13 19:14:51 BB-8 kernel:  ? _copy_from_user+0x2e/0x70
Dec 13 19:14:51 BB-8 kernel:  zfsdev_ioctl+0x57/0xf0 [zfs]
Dec 13 19:14:51 BB-8 kernel:  __x64_sys_ioctl+0x92/0xd0
Dec 13 19:14:51 BB-8 kernel:  do_syscall_64+0x59/0xc0
Dec 13 19:14:51 BB-8 kernel:  ? exit_to_user_mode_prepare+0x90/0x1b0
Dec 13 19:14:51 BB-8 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Dec 13 19:14:51 BB-8 kernel:  ? do_syscall_64+0x69/0xc0
Dec 13 19:14:51 BB-8 kernel:  ? exit_to_user_mode_prepare+0x37/0x1b0
Dec 13 19:14:51 BB-8 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Dec 13 19:14:51 BB-8 kernel:  ? __x64_sys_write+0x1a/0x20
Dec 13 19:14:51 BB-8 kernel:  ? do_syscall_64+0x69/0xc0
Dec 13 19:14:51 BB-8 kernel:  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Dec 13 19:14:51 BB-8 kernel: RIP: 0033:0x7ff79e0915f7
Dec 13 19:14:51 BB-8 kernel: RSP: 002b:00007ffc0f8150b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Dec 13 19:14:51 BB-8 kernel: RAX: ffffffffffffffda RBX: 000055e9c246c300 RCX: 00007ff79e0915f7
Dec 13 19:14:51 BB-8 kernel: RDX: 00007ffc0f8150e0 RSI: 0000000000005a05 RDI: 0000000000000003
Dec 13 19:14:51 BB-8 kernel: RBP: 00007ffc0f8186d0 R08: 000055e9c269b090 R09: 00007ff79e16ebe0
Dec 13 19:14:51 BB-8 kernel: R10: 0000000000000080 R11: 0000000000000246 R12: 00007ffc0f8150e0
Dec 13 19:14:51 BB-8 kernel: R13: 000055e9c245d570 R14: 0000000000000000 R15: 00007ffc0f8186f4
Dec 13 19:14:51 BB-8 kernel:  </TASK>

malventano commented 1 year ago

I took some time to elaborate on this issue while spinning up a new SSD pool. The freshly created pool initialized at >4GB/s, but if that initialization was interrupted and then resumed (cause of the interruption appears irrelevant - even a zpool initialize -s will do it), the resumed progress was limited to 600-800 MB/s. The only way to return to 4GB/s was by destroying and re-creating the pool. Watching the previously interrupted init more closely, random drives in the pool are going idle for seconds at a time, and progress was skewing by several percent to various drives. I can eliminate the hardware as an issue due to solid performance during normal operation - zfs recv to this pool runs at a steady 2GB/s, bursting at >6GB/s, even to the SSDs in uninitialized/untrimmed form.

openzfs / zfs