openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.44k stars 1.73k forks source link

zpool initialize speed is significantly slower after initialize suspend / resume #13700

Open malventano opened 2 years ago

malventano commented 2 years ago

System information

Type Version/Name
Distribution Name Proxmox VE
Distribution Version 7.3-3
Kernel Version 5.15.74-1-pve
Architecture x86_64
OpenZFS Version 2.1.6-pve1

Describe the problem you're observing

If a large HDD array initialization is suspended and then resumed, the initialization speed after resume will be extremely slow.

Describe how to reproduce the problem

zpool create test -o ashift=12 -O compression=on -O atime=off -O xattr=sa -O recordsize=1M raidz3 ... zpool initialize test (allow to progress for a few hours) zpool initialize -s test (optional reboot) zpool initialize test (note that after several hours, array write speed will be significantly slower)

Include any warning/errors/backtraces from the system logs

None to report yet. I'm leaving the system in the current state (very slow init still at 61%) in case any replies suggest steps to elaborate while the issue exists. Else I'll destroy/create with smaller HDD counts to repro additional/simpler configs for further elaboration.

Notes

This may be related to https://github.com/openzfs/zfs/issues/11948

malventano commented 2 years ago

Additional info:

malventano commented 2 years ago
malventano commented 2 years ago

Updated to OpenZFS 2.1.5 and increased BW to the backplanes to speed up troubleshooting.

malventano commented 1 year ago

Revisiting this on a new pool on a different set of drives/enclosures (different make/models). I was able to replicate this issue by rebooting the server at 67% initialized. Pool was initializing at ~8GB/s prior to the reboot. After reboot it has been ~300 MB/s steady for the past 8 hours. Something about resuming a zpool initializion is not playing nicely.

malventano commented 1 year ago

Poking at this issue further, I tried some suspending / resuming of the initialize and saw the same slow speed. The behavior is odd in that lots of CPU+iowait occurs when first resuming the initialize, presumably while vdev_initialize threads are running some loop to count up to where the scrub left off (there is no actual disk IO during this period). After a few minutes the disk IO begins and initialize resumes, but extremely slowly, and watching disk activity some disks are getting smaller IOs (e.g. 17 IOPS / a few dozen kB/s) while other drives are going faster (e.g. 25 IOPS / 25 MB/s). The faster speeds appear to rotate around to different disks every few seconds. During this time, the CPU is fully idle and iowaits negligible. zpool iostat -vw test output confirms that no device latencies exceeded 1/2s (for the entire session) and most IOs were in the microsecond range, so the slowness is not coming from the drives as was the case with the FW issue drives seen earlier in this thread (turns out that issue was just coaxing this behavior more quickly). The behavior feels like something about the initialize is trying to equalize progress across drives but is getting tripped up, but that's just a guess.

Attempting to elaborate further, I did the following;

That last cancel attempt resulted in a prompt hung for nearly an hour. Journalctl entry seen during the hang is pasted below.

...so on one hand we have zpool initialization slowing to the point of effectively taking an indefinite time to complete if the system reboots or the initialize is otherwise interrupted, and on the other hand, we have something about the resume attempts going sideways if the user attempts to cancel the initialize too quickly (which is a likely result should the user notice the heavy resource spike).

These issues appear to be amplified with larger arrays, where the user needs to commit to 100% uptime and zero disruptions until complete, or else they risk an interrupted initialization becoming effectively impossible to complete (multiple months at the slower rate) and/or being forced to reboot and then only able to stop the initialize after the high CPU use portion of the resume.

Dec 13 19:14:51 BB-8 kernel: INFO: task zpool:572773 blocked for more than 241 seconds.
Dec 13 19:14:51 BB-8 kernel:       Tainted: P           O      5.15.74-1-pve #1
Dec 13 19:14:51 BB-8 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 13 19:14:51 BB-8 kernel: task:zpool           state:D stack:    0 pid:572773 ppid:265552 flags:0x00000000
Dec 13 19:14:51 BB-8 kernel: Call Trace:
Dec 13 19:14:51 BB-8 kernel:  <TASK>
Dec 13 19:14:51 BB-8 kernel:  __schedule+0x34e/0x1740
Dec 13 19:14:51 BB-8 kernel:  ? __unfreeze_partials+0x134/0x180
Dec 13 19:14:51 BB-8 kernel:  ? cpumask_next_wrap+0x33/0x90
Dec 13 19:14:51 BB-8 kernel:  ? cpumask_next+0x23/0x30
Dec 13 19:14:51 BB-8 kernel:  schedule+0x69/0x110
Dec 13 19:14:51 BB-8 kernel:  schedule_preempt_disabled+0xe/0x20
Dec 13 19:14:51 BB-8 kernel:  __mutex_lock.constprop.0+0x255/0x480
Dec 13 19:14:51 BB-8 kernel:  __mutex_lock_slowpath+0x13/0x20
Dec 13 19:14:51 BB-8 kernel:  mutex_lock+0x38/0x50
Dec 13 19:14:51 BB-8 kernel:  spa_open_common+0x61/0x450 [zfs]
Dec 13 19:14:51 BB-8 kernel:  spa_get_stats+0x5b/0x4d0 [zfs]
Dec 13 19:14:51 BB-8 kernel:  ? __kmalloc_node+0x166/0x3a0
Dec 13 19:14:51 BB-8 kernel:  zfs_ioc_pool_stats+0x39/0x90 [zfs]
Dec 13 19:14:51 BB-8 kernel:  zfsdev_ioctl_common+0x764/0x9e0 [zfs]
Dec 13 19:14:51 BB-8 kernel:  ? schedule+0x69/0x110
Dec 13 19:14:51 BB-8 kernel:  ? _copy_from_user+0x2e/0x70
Dec 13 19:14:51 BB-8 kernel:  zfsdev_ioctl+0x57/0xf0 [zfs]
Dec 13 19:14:51 BB-8 kernel:  __x64_sys_ioctl+0x92/0xd0
Dec 13 19:14:51 BB-8 kernel:  do_syscall_64+0x59/0xc0
Dec 13 19:14:51 BB-8 kernel:  ? exit_to_user_mode_prepare+0x90/0x1b0
Dec 13 19:14:51 BB-8 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Dec 13 19:14:51 BB-8 kernel:  ? do_syscall_64+0x69/0xc0
Dec 13 19:14:51 BB-8 kernel:  ? exit_to_user_mode_prepare+0x37/0x1b0
Dec 13 19:14:51 BB-8 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Dec 13 19:14:51 BB-8 kernel:  ? __x64_sys_write+0x1a/0x20
Dec 13 19:14:51 BB-8 kernel:  ? do_syscall_64+0x69/0xc0
Dec 13 19:14:51 BB-8 kernel:  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Dec 13 19:14:51 BB-8 kernel: RIP: 0033:0x7ff79e0915f7
Dec 13 19:14:51 BB-8 kernel: RSP: 002b:00007ffc0f8150b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Dec 13 19:14:51 BB-8 kernel: RAX: ffffffffffffffda RBX: 000055e9c246c300 RCX: 00007ff79e0915f7
Dec 13 19:14:51 BB-8 kernel: RDX: 00007ffc0f8150e0 RSI: 0000000000005a05 RDI: 0000000000000003
Dec 13 19:14:51 BB-8 kernel: RBP: 00007ffc0f8186d0 R08: 000055e9c269b090 R09: 00007ff79e16ebe0
Dec 13 19:14:51 BB-8 kernel: R10: 0000000000000080 R11: 0000000000000246 R12: 00007ffc0f8150e0
Dec 13 19:14:51 BB-8 kernel: R13: 000055e9c245d570 R14: 0000000000000000 R15: 00007ffc0f8186f4
Dec 13 19:14:51 BB-8 kernel:  </TASK>
malventano commented 1 year ago

I took some time to elaborate on this issue while spinning up a new SSD pool. The freshly created pool initialized at >4GB/s, but if that initialization was interrupted and then resumed (cause of the interruption appears irrelevant - even a zpool initialize -s will do it), the resumed progress was limited to 600-800 MB/s. The only way to return to 4GB/s was by destroying and re-creating the pool. Watching the previously interrupted init more closely, random drives in the pool are going idle for seconds at a time, and progress was skewing by several percent to various drives. I can eliminate the hardware as an issue due to solid performance during normal operation - zfs recv to this pool runs at a steady 2GB/s, bursting at >6GB/s, even to the SSDs in uninitialized/untrimmed form.