Open malventano opened 2 years ago
Additional info:
64%
62%
61%
63%
62%
61%
Updated to OpenZFS 2.1.5 and increased BW to the backplanes to speed up troubleshooting.
Revisiting this on a new pool on a different set of drives/enclosures (different make/models). I was able to replicate this issue by rebooting the server at 67% initialized. Pool was initializing at ~8GB/s prior to the reboot. After reboot it has been ~300 MB/s steady for the past 8 hours. Something about resuming a zpool initializion is not playing nicely.
Poking at this issue further, I tried some suspending / resuming of the initialize and saw the same slow speed. The behavior is odd in that lots of CPU+iowait occurs when first resuming the initialize, presumably while vdev_initialize threads are running some loop to count up to where the scrub left off (there is no actual disk IO during this period). After a few minutes the disk IO begins and initialize resumes, but extremely slowly, and watching disk activity some disks are getting smaller IOs (e.g. 17 IOPS / a few dozen kB/s) while other drives are going faster (e.g. 25 IOPS / 25 MB/s). The faster speeds appear to rotate around to different disks every few seconds. During this time, the CPU is fully idle and iowaits negligible. zpool iostat -vw test
output confirms that no device latencies exceeded 1/2s (for the entire session) and most IOs were in the microsecond range, so the slowness is not coming from the drives as was the case with the FW issue drives seen earlier in this thread (turns out that issue was just coaxing this behavior more quickly). The behavior feels like something about the initialize is trying to equalize progress across drives but is getting tripped up, but that's just a guess.
Attempting to elaborate further, I did the following;
That last cancel attempt resulted in a prompt hung for nearly an hour. Journalctl entry seen during the hang is pasted below.
...so on one hand we have zpool initialization slowing to the point of effectively taking an indefinite time to complete if the system reboots or the initialize is otherwise interrupted, and on the other hand, we have something about the resume attempts going sideways if the user attempts to cancel the initialize too quickly (which is a likely result should the user notice the heavy resource spike).
These issues appear to be amplified with larger arrays, where the user needs to commit to 100% uptime and zero disruptions until complete, or else they risk an interrupted initialization becoming effectively impossible to complete (multiple months at the slower rate) and/or being forced to reboot and then only able to stop the initialize after the high CPU use portion of the resume.
Dec 13 19:14:51 BB-8 kernel: INFO: task zpool:572773 blocked for more than 241 seconds.
Dec 13 19:14:51 BB-8 kernel: Tainted: P O 5.15.74-1-pve #1
Dec 13 19:14:51 BB-8 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 13 19:14:51 BB-8 kernel: task:zpool state:D stack: 0 pid:572773 ppid:265552 flags:0x00000000
Dec 13 19:14:51 BB-8 kernel: Call Trace:
Dec 13 19:14:51 BB-8 kernel: <TASK>
Dec 13 19:14:51 BB-8 kernel: __schedule+0x34e/0x1740
Dec 13 19:14:51 BB-8 kernel: ? __unfreeze_partials+0x134/0x180
Dec 13 19:14:51 BB-8 kernel: ? cpumask_next_wrap+0x33/0x90
Dec 13 19:14:51 BB-8 kernel: ? cpumask_next+0x23/0x30
Dec 13 19:14:51 BB-8 kernel: schedule+0x69/0x110
Dec 13 19:14:51 BB-8 kernel: schedule_preempt_disabled+0xe/0x20
Dec 13 19:14:51 BB-8 kernel: __mutex_lock.constprop.0+0x255/0x480
Dec 13 19:14:51 BB-8 kernel: __mutex_lock_slowpath+0x13/0x20
Dec 13 19:14:51 BB-8 kernel: mutex_lock+0x38/0x50
Dec 13 19:14:51 BB-8 kernel: spa_open_common+0x61/0x450 [zfs]
Dec 13 19:14:51 BB-8 kernel: spa_get_stats+0x5b/0x4d0 [zfs]
Dec 13 19:14:51 BB-8 kernel: ? __kmalloc_node+0x166/0x3a0
Dec 13 19:14:51 BB-8 kernel: zfs_ioc_pool_stats+0x39/0x90 [zfs]
Dec 13 19:14:51 BB-8 kernel: zfsdev_ioctl_common+0x764/0x9e0 [zfs]
Dec 13 19:14:51 BB-8 kernel: ? schedule+0x69/0x110
Dec 13 19:14:51 BB-8 kernel: ? _copy_from_user+0x2e/0x70
Dec 13 19:14:51 BB-8 kernel: zfsdev_ioctl+0x57/0xf0 [zfs]
Dec 13 19:14:51 BB-8 kernel: __x64_sys_ioctl+0x92/0xd0
Dec 13 19:14:51 BB-8 kernel: do_syscall_64+0x59/0xc0
Dec 13 19:14:51 BB-8 kernel: ? exit_to_user_mode_prepare+0x90/0x1b0
Dec 13 19:14:51 BB-8 kernel: ? syscall_exit_to_user_mode+0x27/0x50
Dec 13 19:14:51 BB-8 kernel: ? do_syscall_64+0x69/0xc0
Dec 13 19:14:51 BB-8 kernel: ? exit_to_user_mode_prepare+0x37/0x1b0
Dec 13 19:14:51 BB-8 kernel: ? syscall_exit_to_user_mode+0x27/0x50
Dec 13 19:14:51 BB-8 kernel: ? __x64_sys_write+0x1a/0x20
Dec 13 19:14:51 BB-8 kernel: ? do_syscall_64+0x69/0xc0
Dec 13 19:14:51 BB-8 kernel: entry_SYSCALL_64_after_hwframe+0x61/0xcb
Dec 13 19:14:51 BB-8 kernel: RIP: 0033:0x7ff79e0915f7
Dec 13 19:14:51 BB-8 kernel: RSP: 002b:00007ffc0f8150b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Dec 13 19:14:51 BB-8 kernel: RAX: ffffffffffffffda RBX: 000055e9c246c300 RCX: 00007ff79e0915f7
Dec 13 19:14:51 BB-8 kernel: RDX: 00007ffc0f8150e0 RSI: 0000000000005a05 RDI: 0000000000000003
Dec 13 19:14:51 BB-8 kernel: RBP: 00007ffc0f8186d0 R08: 000055e9c269b090 R09: 00007ff79e16ebe0
Dec 13 19:14:51 BB-8 kernel: R10: 0000000000000080 R11: 0000000000000246 R12: 00007ffc0f8150e0
Dec 13 19:14:51 BB-8 kernel: R13: 000055e9c245d570 R14: 0000000000000000 R15: 00007ffc0f8186f4
Dec 13 19:14:51 BB-8 kernel: </TASK>
I took some time to elaborate on this issue while spinning up a new SSD pool. The freshly created pool initialized at >4GB/s, but if that initialization was interrupted and then resumed (cause of the interruption appears irrelevant - even a zpool initialize -s
will do it), the resumed progress was limited to 600-800 MB/s. The only way to return to 4GB/s was by destroying and re-creating the pool. Watching the previously interrupted init more closely, random drives in the pool are going idle for seconds at a time, and progress was skewing by several percent to various drives. I can eliminate the hardware as an issue due to solid performance during normal operation - zfs recv to this pool runs at a steady 2GB/s, bursting at >6GB/s, even to the SSDs in uninitialized/untrimmed form.
System information
Describe the problem you're observing
If a large HDD array initialization is suspended and then resumed, the initialization speed after resume will be extremely slow.
I've repro'd this 3x now. My example pre > post speeds were 3GB/s > 60MB/s.
Speeds on a wider pool dropped from 8GB/s to 300MB/s.
Reduced speed was observed for days after the resume (assuming indefinitely).
Nothing abnormal in logs, iowaits were fine on all disks, and the zpool was idle (no host activity).
zpool iostat -vl test 1
shows all disks to have equal stats, and 1-sec totals range from 0 - ~500MB/s.Suspending the initialize again, rebooting, and resuming results in same slow speed.
Destroying / creating the zpool with no other changes (no reboot even) results in initialization speed returning to 3GB/s.
Suspend was at 48% on the first attempt, then recreated the pool and went to 60% for the second attempt (to repro).
24+ hours later init has increased by 1% on an otherwise idle system.
I first stumbled upon this bug when an initialize was halted due to operator (HBA cable was pulled) and noting slow progress after issuing a clear, but repro'd it 2x by manually suspending/resuming an initialize after some time.
Note that a separate HDD FW issue complicated findings earlier, but I have isolated that and reproduced on a different set/model of drives and enclosures.
Describe how to reproduce the problem
zpool create test -o ashift=12 -O compression=on -O atime=off -O xattr=sa -O recordsize=1M raidz3 ... zpool initialize test (allow to progress for a few hours) zpool initialize -s test (optional reboot) zpool initialize test (note that after several hours, array write speed will be significantly slower)
Include any warning/errors/backtraces from the system logs
None to report yet. I'm leaving the system in the current state (very slow init still at 61%) in case any replies suggest steps to elaborate while the issue exists. Else I'll destroy/create with smaller HDD counts to repro additional/simpler configs for further elaboration.
Notes
This may be related to https://github.com/openzfs/zfs/issues/11948