openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.69k stars 1.76k forks source link

~30% performance degradation for ZFS vs. non-ZFS for large file transfer #14346

Open kyle0r opened 1 year ago

kyle0r commented 1 year ago

System information

Type Version/Name
Distribution Name Proxmox (Debian)
Distribution Version 7.3-3 (Debian 10 bullseye)
Kernel Version 5.15.74-1-pve
Architecture x86_64
OpenZFS Version zfs-2.1.6-pve1

Describe the problem you're observing

I was migrating XFS disk partitions from full disk (no zfs) to XFS raw disk images stored on zfs datasets... both virtual disks provisioned to a kvm via virtio_blk.

I spotted some strange performance issues and ended up doing a bunch of testing/research to try and figure it out.

In my testing I'm witnessing a fairly remarkable and repeatable ~30% performance degradation pattern visible in the netdata graphs, and transfer runtime. IO size remains constant but IOPS and IO bandwidth drop off significantly when compared to tests without ZFS.

This ~30% degradation applies to ZFS single disk, mirror or striped pool. I tested all 3 configs.

Scenario: My testing is primarily measuring the transfer (seq write) and checksumming (seq read) of a 2.61TiB snapraid parity file between two disks. Its a single large file sorted on the XFS file system. The tests are running inside a kvm. The physical disks are 5TB 2.5” 5600 rpm SMR Seagate Barracuda’s. For ZFS tests the XFS file system is stored on a raw disk image on a zfs dataset and provisioned to the kvm via virtio_blk.

I’m suspicious the root cause of the degradation could be related to #9130 because the issue reproduced in test #4 but that might also be a red herring.

Here are some graphs for illustration of the degradation:

write to xfs partition - no zfs: image

write to xfs partition stored on zfs dataset: image

read from xfs partition - no zfs: image

checksum read from partition stored on zfs dataset: image

I've published my related research here. I've shared all the details and raw results and graphs there. It feels too much to information re-host it in this issue?

Overall conclusion(s)

  1. Ironically Test #1 was the best performing OpenZFS result and all attempts to improve the results were unsuccessful. 😢
  2. There is a fairly remarkable and repeatable performance degradation pattern visible in the netdata graphs for the OpenZFS tests. IO size remains constant but IOPS and IO bandwidth drop off significantly when compared to test #3 without ZFS.
  3. For write bw test #10 2x striped zpool was still slower than a single disk non-zfs test #3.
  4. Intel 900P slog doesn’t help to stabilise or mitigate the issue.
  5. Test #3 demonstrates the kvm virtio_blk can handle at least 121 and 125 MiB/s seq writes and reads on these disks i.e. kvm and virtio_blk overhead is unlikely to causing the performance degradation or bottlenecks.
  6. Test #15 demonstrates that virtio_scsi is not faster than virtio_blk and likely has more overhead.
  7. These sequential IO tests have demonstrated that for this hardware and workload OpenZFS has an overhead/cost of ~30% IO bw performance vs. the non-ZFS tests. This ~30% degradation applies to ZFS single disk, mirror or striped pool.
  8. The known OpenZFS issue #9130 “task txg_sync blocked for more than 120 seconds” was reproduced in Test #4. Having met this issue in the past, I’m suspicious the root cause of #9130 may be related to/and or causing the IO degradation observed in these tests?
  9. The striped pool test #10 demonstrated fairly consistent IO bandwidth on both disks during the read and write tests. The obvious degradation in the single vdev tests was not visible in the graphs - however overall the performance was still ~30% under the physical maximums of the disks. Question: why does the IO degradation pattern seem to disappear in this test but not others? Question: why is there a ~30% overhead for ZFS?
  10. Dataset compression on or off doesn’t appear to have a negative impact on performance. In fact testing suggests for this sequential workload compression probably helps performance.
  11. Dataset encryption on or off doesn’t appear to have a negative impact on performance.
  12. Dataset checksum on or off doesn’t appear to have a significant impact on performance.
  13. zvol performance shared a familiar symmetry to datasets tests AND incurs a ~4.6 multiplier / ~360% increase in system load avg. This is consistent with my previous testing of zvols and the known OpenZFS issue #11407.
  14. For this workload/test there doesn’t appear to of been any significant change/benefit in upgrading the hypervisor to the latest packages/kernels. pve 7.1-10 and kernel 5.13.19-4-pve and zfs-2.1.2-pve1 vs. pve 7.3-3 and kernel 5.15.74-1-pve and zfs-2.1.6-pve1
  15. There doesn’t appear to of been any significant benefit to changing the zfs recordsize from the default 128K to 256K to match the snapraid default parity block size. For this workload it seems ZFS performs better when recordsize is left default.
  16. Test #1 (parity file 1 of 3) and Test #17 (parity file 2 of 3) suggest issue is not specific to one file.

test result summary

image

image

Describe how to reproduce the problem

Transfer e.g. rsync a >2TiB file to a virtual disk provisioned via zfs dataset raw disk image via virtio_blk. Details of my, hardware, config test cases and commands can be found in my research here.

Include any warning/errors/backtraces from the system logs

Test #4 reproduced OpenZFS issue #9130 “task txg_sync blocked for more than 120 seconds”. Here are the related logs:

Nov 26 08:48:18 viper kernel: INFO: task txg_sync:447490 blocked for more than 120 seconds.
Nov 26 08:48:18 viper kernel:       Tainted: P           O      5.13.19-4-pve #1
Nov 26 08:48:18 viper kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 26 08:48:18 viper kernel: task:txg_sync        state:D stack:    0 pid:447490 ppid:     2 flags:0x00004000
Nov 26 08:48:18 viper kernel: Call Trace:
Nov 26 08:48:18 viper kernel:  <TASK>
Nov 26 08:48:18 viper kernel:  __schedule+0x2fa/0x910
Nov 26 08:48:18 viper kernel:  schedule+0x4f/0xc0
Nov 26 08:48:18 viper kernel:  schedule_timeout+0x8a/0x140
Nov 26 08:48:18 viper kernel:  ? __bpf_trace_tick_stop+0x10/0x10
Nov 26 08:48:18 viper kernel:  io_schedule_timeout+0x51/0x80
Nov 26 08:48:18 viper kernel:  __cv_timedwait_common+0x131/0x170 [spl]
Nov 26 08:48:18 viper kernel:  ? wait_woken+0x80/0x80
Nov 26 08:48:18 viper kernel:  __cv_timedwait_io+0x19/0x20 [spl]
Nov 26 08:48:18 viper kernel:  zio_wait+0x133/0x2c0 [zfs]
Nov 26 08:48:18 viper kernel:  dsl_pool_sync+0xcc/0x4f0 [zfs]
Nov 26 08:48:18 viper kernel:  spa_sync+0x55a/0xff0 [zfs]
Nov 26 08:48:18 viper kernel:  ? spa_txg_history_init_io+0x106/0x110 [zfs]
Nov 26 08:48:18 viper kernel:  txg_sync_thread+0x2d3/0x460 [zfs]
Nov 26 08:48:18 viper kernel:  ? txg_init+0x260/0x260 [zfs]
Nov 26 08:48:18 viper kernel:  thread_generic_wrapper+0x79/0x90 [spl]
Nov 26 08:48:18 viper kernel:  ? __thread_exit+0x20/0x20 [spl]
Nov 26 08:48:18 viper kernel:  kthread+0x12b/0x150
Nov 26 08:48:18 viper kernel:  ? set_kthread_struct+0x50/0x50
Nov 26 08:48:18 viper kernel:  ret_from_fork+0x22/0x30
Nov 26 08:48:18 viper kernel:  </TASK>
Nov 26 08:58:22 viper kernel: INFO: task txg_sync:447490 blocked for more than 120 seconds.
Nov 26 08:58:22 viper kernel:       Tainted: P           O      5.13.19-4-pve #1
Nov 26 08:58:22 viper kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 26 08:58:22 viper kernel: task:txg_sync        state:D stack:    0 pid:447490 ppid:     2 flags:0x00004000
Nov 26 08:58:22 viper kernel: Call Trace:
Nov 26 08:58:22 viper kernel:  <TASK>
Nov 26 08:58:22 viper kernel:  __schedule+0x2fa/0x910
Nov 26 08:58:22 viper kernel:  schedule+0x4f/0xc0
Nov 26 08:58:22 viper kernel:  schedule_timeout+0x8a/0x140
Nov 26 08:58:22 viper kernel:  ? __bpf_trace_tick_stop+0x10/0x10
Nov 26 08:58:22 viper kernel:  io_schedule_timeout+0x51/0x80
Nov 26 08:58:22 viper kernel:  __cv_timedwait_common+0x131/0x170 [spl]
Nov 26 08:58:22 viper kernel:  ? wait_woken+0x80/0x80
Nov 26 08:58:22 viper kernel:  __cv_timedwait_io+0x19/0x20 [spl]
Nov 26 08:58:22 viper kernel:  zio_wait+0x133/0x2c0 [zfs]
Nov 26 08:58:22 viper kernel:  dsl_pool_sync+0xcc/0x4f0 [zfs]
Nov 26 08:58:22 viper kernel:  spa_sync+0x55a/0xff0 [zfs]
Nov 26 08:58:22 viper kernel:  ? spa_txg_history_init_io+0x106/0x110 [zfs]
Nov 26 08:58:22 viper kernel:  txg_sync_thread+0x2d3/0x460 [zfs]
Nov 26 08:58:22 viper kernel:  ? txg_init+0x260/0x260 [zfs]
Nov 26 08:58:22 viper kernel:  thread_generic_wrapper+0x79/0x90 [spl]
Nov 26 08:58:22 viper kernel:  ? __thread_exit+0x20/0x20 [spl]
Nov 26 08:58:22 viper kernel:  kthread+0x12b/0x150
Nov 26 08:58:22 viper kernel:  ? set_kthread_struct+0x50/0x50
Nov 26 08:58:22 viper kernel:  ret_from_fork+0x22/0x30
Nov 26 08:58:22 viper kernel:  </TASK>
Nov 26 09:10:27 viper kernel: INFO: task txg_sync:447490 blocked for more than 120 seconds.
Nov 26 09:10:27 viper kernel:       Tainted: P           O      5.13.19-4-pve #1
Nov 26 09:10:27 viper kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 26 09:10:27 viper kernel: task:txg_sync        state:D stack:    0 pid:447490 ppid:     2 flags:0x00004000
Nov 26 09:10:27 viper kernel: Call Trace:
Nov 26 09:10:27 viper kernel:  <TASK>
Nov 26 09:10:27 viper kernel:  __schedule+0x2fa/0x910
Nov 26 09:10:27 viper kernel:  schedule+0x4f/0xc0
Nov 26 09:10:27 viper kernel:  schedule_timeout+0x8a/0x140
Nov 26 09:10:27 viper kernel:  ? __bpf_trace_tick_stop+0x10/0x10
Nov 26 09:10:27 viper kernel:  io_schedule_timeout+0x51/0x80
Nov 26 09:10:27 viper kernel:  __cv_timedwait_common+0x131/0x170 [spl]
Nov 26 09:10:27 viper kernel:  ? wait_woken+0x80/0x80
Nov 26 09:10:27 viper kernel:  __cv_timedwait_io+0x19/0x20 [spl]
Nov 26 09:10:27 viper kernel:  zio_wait+0x133/0x2c0 [zfs]
Nov 26 09:10:27 viper kernel:  dsl_pool_sync+0xcc/0x4f0 [zfs]
Nov 26 09:10:27 viper kernel:  spa_sync+0x55a/0xff0 [zfs]
Nov 26 09:10:27 viper kernel:  ? spa_txg_history_init_io+0x106/0x110 [zfs]
Nov 26 09:10:27 viper kernel:  txg_sync_thread+0x2d3/0x460 [zfs]
Nov 26 09:10:27 viper kernel:  ? txg_init+0x260/0x260 [zfs]
Nov 26 09:10:27 viper kernel:  thread_generic_wrapper+0x79/0x90 [spl]
Nov 26 09:10:27 viper kernel:  ? __thread_exit+0x20/0x20 [spl]
Nov 26 09:10:27 viper kernel:  kthread+0x12b/0x150
Nov 26 09:10:27 viper kernel:  ? set_kthread_struct+0x50/0x50
Nov 26 09:10:27 viper kernel:  ret_from_fork+0x22/0x30
Nov 26 09:10:27 viper kernel:  </TASK>
Nov 26 09:22:32 viper kernel: INFO: task txg_sync:447490 blocked for more than 120 seconds.
Nov 26 09:22:32 viper kernel:       Tainted: P           O      5.13.19-4-pve #1
Nov 26 09:22:32 viper kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 26 09:22:32 viper kernel: task:txg_sync        state:D stack:    0 pid:447490 ppid:     2 flags:0x00004000
Nov 26 09:22:32 viper kernel: Call Trace:
Nov 26 09:22:32 viper kernel:  <TASK>
Nov 26 09:22:32 viper kernel:  __schedule+0x2fa/0x910
Nov 26 09:22:32 viper kernel:  schedule+0x4f/0xc0
Nov 26 09:22:32 viper kernel:  schedule_timeout+0x8a/0x140
Nov 26 09:22:32 viper kernel:  ? __bpf_trace_tick_stop+0x10/0x10
Nov 26 09:22:32 viper kernel:  io_schedule_timeout+0x51/0x80
Nov 26 09:22:32 viper kernel:  __cv_timedwait_common+0x131/0x170 [spl]
Nov 26 09:22:32 viper kernel:  ? wait_woken+0x80/0x80
Nov 26 09:22:32 viper kernel:  __cv_timedwait_io+0x19/0x20 [spl]
Nov 26 09:22:32 viper kernel:  zio_wait+0x133/0x2c0 [zfs]
Nov 26 09:22:32 viper kernel:  dsl_pool_sync+0xcc/0x4f0 [zfs]
Nov 26 09:22:32 viper kernel:  spa_sync+0x55a/0xff0 [zfs]
Nov 26 09:22:32 viper kernel:  ? spa_txg_history_init_io+0x106/0x110 [zfs]
Nov 26 09:22:32 viper kernel:  txg_sync_thread+0x2d3/0x460 [zfs]
Nov 26 09:22:32 viper kernel:  ? txg_init+0x260/0x260 [zfs]
Nov 26 09:22:32 viper kernel:  thread_generic_wrapper+0x79/0x90 [spl]
Nov 26 09:22:32 viper kernel:  ? __thread_exit+0x20/0x20 [spl]
Nov 26 09:22:32 viper kernel:  kthread+0x12b/0x150
Nov 26 09:22:32 viper kernel:  ? set_kthread_struct+0x50/0x50
Nov 26 09:22:32 viper kernel:  ret_from_fork+0x22/0x30
Nov 26 09:22:32 viper kernel:  </TASK>
Nov 26 09:48:43 viper kernel: INFO: task txg_sync:447490 blocked for more than 120 seconds.
Nov 26 09:48:43 viper kernel:       Tainted: P           O      5.13.19-4-pve #1
Nov 26 09:48:43 viper kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 26 09:48:43 viper kernel: task:txg_sync        state:D stack:    0 pid:447490 ppid:     2 flags:0x00004000
Nov 26 09:48:43 viper kernel: Call Trace:
Nov 26 09:48:43 viper kernel:  <TASK>
Nov 26 09:48:43 viper kernel:  __schedule+0x2fa/0x910
Nov 26 09:48:43 viper kernel:  schedule+0x4f/0xc0
Nov 26 09:48:43 viper kernel:  schedule_timeout+0x8a/0x140
Nov 26 09:48:43 viper kernel:  ? __bpf_trace_tick_stop+0x10/0x10
Nov 26 09:48:43 viper kernel:  io_schedule_timeout+0x51/0x80
Nov 26 09:48:43 viper kernel:  __cv_timedwait_common+0x131/0x170 [spl]
Nov 26 09:48:43 viper kernel:  ? wait_woken+0x80/0x80
Nov 26 09:48:43 viper kernel:  __cv_timedwait_io+0x19/0x20 [spl]
Nov 26 09:48:43 viper kernel:  zio_wait+0x133/0x2c0 [zfs]
Nov 26 09:48:43 viper kernel:  dsl_pool_sync+0xcc/0x4f0 [zfs]
Nov 26 09:48:43 viper kernel:  spa_sync+0x55a/0xff0 [zfs]
Nov 26 09:48:43 viper kernel:  ? spa_txg_history_init_io+0x106/0x110 [zfs]
Nov 26 09:48:43 viper kernel:  txg_sync_thread+0x2d3/0x460 [zfs]
Nov 26 09:48:43 viper kernel:  ? txg_init+0x260/0x260 [zfs]
Nov 26 09:48:43 viper kernel:  thread_generic_wrapper+0x79/0x90 [spl]
Nov 26 09:48:43 viper kernel:  ? __thread_exit+0x20/0x20 [spl]
Nov 26 09:48:43 viper kernel:  kthread+0x12b/0x150
Nov 26 09:48:43 viper kernel:  ? set_kthread_struct+0x50/0x50
Nov 26 09:48:43 viper kernel:  ret_from_fork+0x22/0x30
Nov 26 09:48:43 viper kernel:  </TASK>
Nov 26 10:04:49 viper kernel: INFO: task txg_sync:447490 blocked for more than 120 seconds.
Nov 26 10:04:49 viper kernel:       Tainted: P           O      5.13.19-4-pve #1
Nov 26 10:04:49 viper kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 26 10:04:49 viper kernel: task:txg_sync        state:D stack:    0 pid:447490 ppid:     2 flags:0x00004000
Nov 26 10:04:49 viper kernel: Call Trace:
Nov 26 10:04:49 viper kernel:  <TASK>
Nov 26 10:04:49 viper kernel:  __schedule+0x2fa/0x910
Nov 26 10:04:49 viper kernel:  schedule+0x4f/0xc0
Nov 26 10:04:49 viper kernel:  schedule_timeout+0x8a/0x140
Nov 26 10:04:49 viper kernel:  ? __bpf_trace_tick_stop+0x10/0x10
Nov 26 10:04:49 viper kernel:  io_schedule_timeout+0x51/0x80
Nov 26 10:04:49 viper kernel:  __cv_timedwait_common+0x131/0x170 [spl]
Nov 26 10:04:49 viper kernel:  ? wait_woken+0x80/0x80
Nov 26 10:04:49 viper kernel:  __cv_timedwait_io+0x19/0x20 [spl]
Nov 26 10:04:49 viper kernel:  zio_wait+0x133/0x2c0 [zfs]
Nov 26 10:04:49 viper kernel:  dsl_pool_sync+0xcc/0x4f0 [zfs]
Nov 26 10:04:49 viper kernel:  spa_sync+0x55a/0xff0 [zfs]
Nov 26 10:04:49 viper kernel:  ? spa_txg_history_init_io+0x106/0x110 [zfs]
Nov 26 10:04:49 viper kernel:  txg_sync_thread+0x2d3/0x460 [zfs]
Nov 26 10:04:49 viper kernel:  ? txg_init+0x260/0x260 [zfs]
Nov 26 10:04:49 viper kernel:  thread_generic_wrapper+0x79/0x90 [spl]
Nov 26 10:04:49 viper kernel:  ? __thread_exit+0x20/0x20 [spl]
Nov 26 10:04:49 viper kernel:  kthread+0x12b/0x150
Nov 26 10:04:49 viper kernel:  ? set_kthread_struct+0x50/0x50
Nov 26 10:04:49 viper kernel:  ret_from_fork+0x22/0x30
Nov 26 10:04:49 viper kernel:  </TASK>
Nov 26 11:47:32 viper kernel: INFO: task txg_sync:447490 blocked for more than 120 seconds.
Nov 26 11:47:32 viper kernel:       Tainted: P           O      5.13.19-4-pve #1
Nov 26 11:47:32 viper kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 26 11:47:32 viper kernel: task:txg_sync        state:D stack:    0 pid:447490 ppid:     2 flags:0x00004000
Nov 26 11:47:32 viper kernel: Call Trace:
Nov 26 11:47:32 viper kernel:  <TASK>
Nov 26 11:47:32 viper kernel:  __schedule+0x2fa/0x910
Nov 26 11:47:32 viper kernel:  schedule+0x4f/0xc0
Nov 26 11:47:32 viper kernel:  schedule_timeout+0x8a/0x140
Nov 26 11:47:32 viper kernel:  ? __bpf_trace_tick_stop+0x10/0x10
Nov 26 11:47:32 viper kernel:  io_schedule_timeout+0x51/0x80
Nov 26 11:47:32 viper kernel:  __cv_timedwait_common+0x131/0x170 [spl]
Nov 26 11:47:32 viper kernel:  ? wait_woken+0x80/0x80
Nov 26 11:47:32 viper kernel:  __cv_timedwait_io+0x19/0x20 [spl]
Nov 26 11:47:32 viper kernel:  zio_wait+0x133/0x2c0 [zfs]
Nov 26 11:47:32 viper kernel:  dsl_pool_sync+0xcc/0x4f0 [zfs]
Nov 26 11:47:32 viper kernel:  spa_sync+0x55a/0xff0 [zfs]
Nov 26 11:47:32 viper kernel:  ? spa_txg_history_init_io+0x106/0x110 [zfs]
Nov 26 11:47:32 viper kernel:  txg_sync_thread+0x2d3/0x460 [zfs]
Nov 26 11:47:32 viper kernel:  ? txg_init+0x260/0x260 [zfs]
Nov 26 11:47:32 viper kernel:  thread_generic_wrapper+0x79/0x90 [spl]
Nov 26 11:47:32 viper kernel:  ? __thread_exit+0x20/0x20 [spl]
Nov 26 11:47:32 viper kernel:  kthread+0x12b/0x150
Nov 26 11:47:32 viper kernel:  ? set_kthread_struct+0x50/0x50
Nov 26 11:47:32 viper kernel:  ret_from_fork+0x22/0x30
Nov 26 11:47:32 viper kernel:  </TASK>

For discussion/illustration - IO Flow without and with OpenZFS

image

ryao commented 1 year ago

Does running echo 0 | sudo tee /sys/module/zfs/parameters/zfs_dmu_offset_next_sync make a difference?

kyle0r commented 1 year ago

I will re-test and let you know @ryao. Thanks for the suggestion.

ZupoLlask commented 1 year ago

Hi @kyle0r, do you have any updates to share with us? Thanks

kyle0r commented 1 year ago

Thank you for the zfs_dmu_offset_next_sync=0 suggestion. I have tested it.

Unfortunately the suggestion didn't move the needle in the desired direction...

The performance degradation pattern was still present (if a little different) and then dropped off a cliff.

image

Before and after the change:

root@viper:~# cat /sys/module/zfs/parameters/zfs_dmu_offset_next_sync
1
root@viper:~# echo 0 | sudo tee /sys/module/zfs/parameters/zfs_dmu_offset_next_sync
0
kyle0r commented 1 year ago

To continue trying to figure this out (process of elimination), I've made an investment in a new drive (due to arrive next week). A Seagate Exos Enterprise - 10E2400 introduced 2017-Q4? This is a 12Gb/s SAS drive, 2.4TB 10K RPM with a 16GB flash cache and CMR not SMR like in the 5TB Barracuda drives I tested in post 1.

Naturally the Exos drive is in a different performance/quality class but nonetheless it will be interesting to see how the Exos drive compares in the re-run of test 1 and test 3 - to see if the new drive displays the ~30% degradation or not.

With my hardware setup I don't expect to have issues mixing SATA and SAS, and in the worst case I can probably move drives around to ensure that 4 bays (single cable on the passive backplane) are used for SAS drives. Kudos to my friend Alasdair for suggesting I double check this aspect.

I'll write an update once the drive arrives and I've performed some testing.

Some reflections since post 1

Q: Why do the ZFS tests sometimes experience a ~80% seq write performance drop and never recover? See the graph from my last comment, and (see the graph in test 4) I wonder if this similar to the issues detected by ServeTheHome with WD RED SMR drives when the Z-RAID resilvering took waaaaaay longer than expected (~9 days vs. ~14 hours). Article here. Vid here.
image

A: I don't know yet but I'd love to see some OpenZFS devs chime in here. Could it be related to OpenZFS issue #9130? Could OpenZFS detect and do something to heal these issues as they occur?

Q: What are the impacts of the 30% degradation on my use cases? A: I'd assume that for day-to-day relatively small IO workloads the issue might be obfuscated. However for larger or long running sequential workloads I think the issue will make itself visible and slow things down. Consider larger (longer) snapraid parity syncs and scrubs. I do 5% nightly scrubs to keep on top of bitrot and exercise the disks to proactively detect disk reliability issues - 5% of the ~15 TB array takes ~30 mins and disk throughput ranges from 370-930 MiB/s. This approach keeps the oldest scrubbed block under 30 days old. I'd strongly assume there would be noteworthy performance degradation in the nightly scrubs and larger syncs. Consider also disk rebuilds/replacement - one can assume this operation would be at least 30% slower than with the current disk setup and might also fall victim to the 80% performance drop off? From my testing shared in post 1 the current 2.61 TiB parity file took ~6h 16m to rsync without ZFS (test 3) and ~7h 26m with ZFS (test 1 - the fastest ZFS test) which is an ~18.4% increase in runtime for the ZFS test 1. In general disk rebuilds would very likely take longer than this rsync testing due to the nature of restoring many non-contiguous blocks, meaning the ZFS performance penalty would be an even greater amount of extra time.

Q: Could I get a performance boost by not using SMR drives for the data disks? Given they are already migrated to ZFS... A: I'm going with the assumption that if the 10k RPM drive doesn't exhibit the same issues at the SMR drives - then there will performance gains to be had just by upgrading the drives to non SMR. There is a cost prohibitive issue: there doesn't appear to be any 2.5" high capacity non SMR spindle based disks - so SSD would be the next choice.

A note on SMR drives

I've known for a while about SMR drives being generally a bad idea but back in 2017 options for 2.5" 5TB disks were limited and I wasn't using ZFS back then. The cost per TB and density per drive bay/slot of the Barracuda's has always been very attractive for long term storage.
Fast forward to now - I'm using ZFS (single disk pools) and a 10 GbE network (which justifies faster drives) and I've done more research on SMR drives specifically in combination with ZFS and it seems like a huge no-no (great summary post here with many good links). I guess I'm fortunate that I don't have any Z-RAID pools (yet) and chose to keep things simple.

I certainly like the idea of following Jim Salters (@jimsalterjrs) advice on using striped mirrors and the benefits of how the performance scales so well in that configuration. Need to solve my open issues first!

The future is probably flash drives anyway right?

I'd welcome suggestions on 5TB+ SATA SSD's that work well with ZFS. I'm researching options for SATA SSD's to replace the 5TB Barracuda's. Right now I'm thinking its a good long term goal to cycle out the SMR spindles (HDD) with flash (SSD) to eliminate this 30% degradation issue (assuming right now its ZFS + SMR drive related), and also take advantage of the other benefits of flash media vs spindles. My use cases are typically file server and sequential workloads, and when I need lots of small IOPS I already have Optane on hand.

To mitigate the performance degradation I've documented here, I guess I could consider smaller than 5TB SSD's and use more of them, with the downside that it would reduce overall storage capacity of the chassis. Not sure that would be more cost effective than investing in the larger SSD's?

Its not clear when SATA based SSD's will become legacy if they aren't already and manufactures will stop launching new drives. It would be really nice (for my use cases at least) to see cost effective, fast and reliable ~5TB SSD's before SATA SSD become extinct.

For my next chassis which I'm currently pricing, I'll likely be going with NVMe U.2 support and shipping the current SATA/SAS chassis for co-lo hosting to form my online off-site backup. Its unclear when high capacity U.2 drives will become affordable for my use cases so for now I'll be looking at for SATA drives and then upgrade to U.2 as the costs come down.

High capacity SATA SSD's on my radar so far:

Introduced Current price Product Observations
2020-Q2? ~435 EUR Samsung 870 QVO SATA 8TB could be a bad idea (cache drop off issues?)
2021-Q2? ~690 EUR Samsung OEM Datacenter SSD PM893 7.68TB
2020-Q2? ~950 EUR Kingston DC500R Data Center Series Read-Centric SSD - 0.6DWPD 7.68TB, SED, SATA
2018-Q3? ~960 EUR Solidigm (Intel) SSD D3-S4510 7.68TB, 2.5", SATA

I also need to be careful and/or cognitive of SSD's that support Data Set Management TRIM and Deterministic Read Zero after TRIM (RZAT) per the LSI HBA KB article. I know that my rpool (boot and root ZFS mirror pool) Crucial MX500 SSD's cannot trim when attached to the LSI HBA because of this factor.

vfs.zfs.vdev.trim_max_active it probably an interesting tunable to remember - the current default seems a little overkill.

ryao commented 1 year ago

I cannot comment on the other things right now, but I strongly suspect that FreeNAS 11.3's ZFS driver was missing the sequential resilver code that had been merged into ZFSOnLinux a few years earlier. That should help drives to resilver faster and should reduce the impact that SMR has.

kyle0r commented 1 year ago

SATA SMR vs. SAS CMR ... *FIGHT*

Disks on test:
src: 2.5" Seagate Barracuda (SMR) - 6Gb/s SATA drive, 5TB 5600 RPM (per post 1). dst: 2.5" Seagate Exos Enterprise 10E2400 (CMR) - 12Gb/s SAS drive, 2.4TB 10K RPM with a 16GB flash cache (this post).

For each test, the XFS file system was provisioned with 2000 GiB, which is +/- 10% less than the maximum capacity of the disk. Note that this is smaller than the tests in post 1 because of the smaller disk size, but everything else about the tests remains the same. It is expected that the src to dst rsync jobs will fail due to lack of space for the 2.61TiB snapraid parity file. Nevertheless, the test approach should still provide good comparisons, as the problems detected will have presented themselves before the tests run out of space.

The same src disk and snapraid parity file were used in the test re-runs per post 1 tests.

Recap on whats being compared

  1. From post 1 SATA SMR vs. this post SAS CMR tests #1 and #3 - with-ZFS and without-ZFS respectively.
  2. SAS CMR tests #1 and #3 - with-ZFS and without-ZFS respectively.

re-run of test #3 (without zfs)

A bit of an academic test at this point, but it doesn't hurt for consistency. This re-run also provides a good baseline comparison for the re-run of test #1. The SATA SMR drives in post 1 test #3 performed within the manufacturer's expected thresholds. The same is expected for the SAS CMR drive.

In this re-run with the CMR SAS drive there was nothing remarkable vs. post 1 test #3. Smooth and consistent IOPS and throughput within expected manufacturer thresholds.

src to dst rsync (seq write test of the dst disk)

The src to dst transfer write io was stable with an avg transfer of 129.97 MiB/s for the 1.95TiB transfer which took ~4h 22m to transfer (space ran out as expected because the dst disk is smaller than the src disk and file).
The write job runtime was 15741 seconds to transfer 2045926.75 MiB (1.95 TiB).

Some graphs from the hypervisor:

system

image

src read

image

image

image

dst write

image

image

image

Observations

  1. The MiB/s performance for the rsync job is increased by ~4.4% vs. the re-run of test #1. So this non-zfs test was slightly faster.
  2. The MiB/s throughput is almost identical to the average read speed of the src disk in post 1 test #3.
  3. The write performance of the dst disk was bottlenecked by the src disk (as expected).

dst checksum (seq read test of the dst disk)

image

227.7 MiB/s average.
Read runtime was 8984 seconds (~2h 29m) for 2045926,75 MiB (1.95 TiB).

Some graphs from the hypervisor:

system

image

dst read

image

image

image

Observations

  1. This test vs. re-run test #1 sees an ~6.7% increase in performance.

re-run of test #1 (with zfs)

Here is how the sas-test pool looked like: image

src to dst rsync (seq write test of the dst disk)

21633-trollface-thumbs-up-15percent

The src to dst transfer write io was stable with an avg transfer of 124.5 MiB/s for the 1.95TiB parity file which took ~4h 33m to transfer (space ran out as expected because the dst disk is smaller than the src disk and file). The write job runtime was 16435 seconds to transfer 2045926.75 MiB (1.95 TiB).

Some graphs from the hypervisor:

System disk IO image

src disk image

image

image

dst disk image

image

image

Observations

  1. The MiB/s throughput is almost identical to the average read speed of the src disk in post 1 test #3.
  2. The MiB/s throughput is very close to average write speed of the dst disk in the re-run of test #3 - slower by ~4%.
  3. The write performance of the dst disk was bottlenecked by the src disk (as expected).
  4. Key point: With this src/dst combo write IO test, the OpenZFS performance degradation pattern that the SATA SMR drives exhibited was not present on this SAS CMR drive. I do wonder if the src disk was as fast or faster than the dst disk if things would look different. I have made a note to test this again once I have a large capacity SSD to act as the src disk.

dst checksum (seq read test of the dst disk)

image

image

This read job doesn't look as consistent as the re-run of test #3 - and seems similar to the dst checksum IO pattern from post 1 test #1 and others zfs tests. 213.4 MiB/s average. Read runtime was 9586 seconds (~2h 40m) for 2045926,75 MiB (1.95 TiB).

image

zoomed in on the end of the previous graph (note the y-axis scale differs on the following (1) graph(s) vs. the other graphs):

image

image

image

image

image

ZFS ARC related graphs for the read job

image

image

Observations

  1. This test is ~10 minutes slower than test #3 which is a ~6.2% decrease in transfer throughput.
  2. The slowest MiB/s dips in the graphs are also lower than in test #3
  3. There is a clear correlation: as the 'average time for I/O requests issued to the device being served' increases, IOPS decrease and throughput also decreases. The average I/O size is constant.

Zoomed in its clearer to see (note the y-axis scale differs on the following (3) graph(s) vs. the other graphs):

image

image

image

Summary of SMR vs. CMR from research in this issue

It seems that OpenZFS certainly suffers degraded performance when reading and writing to my SATA SMR disks, sometimes catastrophic write drops (e.g. post 6).

OpenZFS performs much better when writing to a SAS CMR disk. By my estimation from testing:
Relating to SEQ WRITE TESTS: My SATA SMR disk suffers ~30% performance degradation vs. the baseline test without ZFS 🤬
My SAS CMR disk is within ~4% of the baseline test without ZFS. 🥳

Summary of the tests in this post

However, the ZFS read I/O pattern on the SAS CMR is worrying, and really worrying on the SATA SMR disks. What worries me the most is the pattern seems to be the same - more pronounced on the SMR disks.

Here is a side by side view of the dst checksum jobs from this post:

image

What is up with the lack of ZFS I/O consistency per this post re-run of test #1 dst checksum job? OpenZFS is ~10 minutes slower in this test (~6.2% decrease in transfer throughput). The SAS CMR disk in the test is about as good as money can buy, and the hardware it runs on is of the same generation as the disk - the server specs are decent enough. Should OpenZFS be performing much closer to the raw disk speed per test #3?
test #3 in this post demonstrate that the SAS CMR disk is performing at around its manufacturers published performance maximums without ZFS but the performance is seeing a ~6% degradation when ZFS is introduced.

I could forgive ZFS a few percentage performance degradation but ~6? and with that IO pattern? Something feels off to me.

I thought SMR might be a smoking gun to explain all the performance issues, it does seem to be the smoking gun for write IO issues... but what about read IO?

How can I help further diagnose this IO weirdness? This is a home lab/office setup - so I can be somewhat flexible on what I'd be willing to try out.

While nothing is 100% - I think this test approach rules out that something is wrong with the test system or methodology, but I'm open to criticism on anything I've shared here. I can provide more details if requested.

Going back to SMR test results from post 1

I still think my overall conclusion(s) from post 1 are still valid.

I think my research reinforces the point about avoiding SMR disks when using ZFS or at the very least be aware of the pitfalls. However I'd also ask the OpenZFS developers to consider if my research could be useful in helping ZFS be more compatible with SMR drives.

I'd like to see the OpenZFS developers weigh in here. What is the Linux kernel code doing (or not doing) to maintain reasonably consistent SMR drive performance (post 1 test #3), or put the other way, what is the ZFS code doing (or not doing) to experience the degraded performance and sometimes catastrophic write IO drop off without recovery on my SMR drives? (e.g. in post 5) - could there be a relation to OpenZFS issue #9130?

image

i.e. Why does ZFS run into this drop off issue and why can't it detect this and heal/recover? Why does the kernel/non-ZFS NOT test suffer the same issues?

Keep in mind that these tests are doing seq read/writes to an empty single-disk pool in a relatively standard hypervisor/kvm setup. No fragmentation or the like.

What is next?

Research wise I think I've exhausted most avenues for now.

My reflection is that ZFS is still awesome - the benefits are very, very valuable BUT there are some open questions for me on IO performance consistency and stability. I'll certainly be much more cognitive of this and critical about disk choice in the future.

My 2.5" SATA SMR drives (most from ~2017) have played there data backup and archive role well. This research has proven that there are sub optimal aspects and pitfalls in using SMR drives with ZFS. As evidenced in this issue they are not performant with ZFS and personally I wouldn't want to use them with Z-RAID. I'd imagine they would be OK in striped mirror pool but wouldn't achieve their manufacturer published performance figures - expect ~30% degradation.

It feels like a pretty safe bet that SATA or U.2 SSD's will be my next storage medium - at least for a chassis that I want to perform well on a 10+ GbE network and work optimally with ZFS.

I think I'll pause my snapraid parity migration to ZFS for now and see if anything changes in the next year or so. My 5% nightly scrub strategy should detect any parity bit rot within 30 days. What are my risks? Any data recovered from parity will be checksum verified. My understanding of snapraid internals is that parity blocks that become corrupt would result in recovered files that are corrupt BUT this would be detected because of the independent checksums that snapraid maintains. This would limit the corrupt blocks in the recovered data to the blocks that were corrupt in parity. I think the chances of a data disk having corrupt blocks or a disk failure are relatively high, the same goes for parity, but the chances of simultaneous problems with data disks/blocks and parity disks/blocks are relatively low. It should be impossible to experience silent corruption. In the case of Murphy's Law - I've got a second copy of all blocks as a backup.

I'll leave this issue open for now to see if anyone else wants to increase the test result sample size and/or if the OpenZFS developers want to look at this.

Should I open a new issue to look more closely at the weird read I/O performance pattern?

ryao commented 1 year ago

My suggestion is to set recordsize=1M, which should mitigate the negative effects of SMR by minimizing the amount of RMW needed to do writes. Also, if you do not need a time updates, which you likely do not, set atime=off.

kyle0r commented 1 year ago

@ryao wrote:

My suggestion is to set recordsize=1M, which should mitigate the negative effects of SMR by minimizing the amount of RMW needed to do writes. Also, if you do not need a time updates, which you likely do not, set atime=off.

Acknowledged - I'll give both options a test on one of the SATA SMR drives and see if it helps. I did try testing with recordsize=256K (the same as the snapraid parity default block size) but that didn't seem to help. AFAIK for snapraid parity drives which store a single large file - atime=off should be fine.

FYI. I am familiar with RMW topics per here. I'm not an expert in understand the behaviour 100% yet though.

Any comments on the weird read IO on the SAS drive?

ryao commented 1 year ago

It is hard to say without data on the number of outstanding IOs to go with that chart. That said, I feel like it is somewhat related to this:

https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/ZIO%20Scheduler.html

There were also some fixes done recently in master to improve prefetch behavior that might help.

kyle0r commented 1 year ago

Ok - I'll have a read 👍 thank you.

number of outstanding IOs to go with that chart

Does this help? - here are all the graphs that netdata provides for individual disks for the test #1 dst checksum seq read job for the SAS CMR disk:

image

image

image

image

image

kyle0r commented 1 year ago

The re-test with recordsize=1M atime=off on the SATA SMR didn't help. Catastrophic performance drop off per graph:

image

I will re-test with recordsize=default (which seemed more stable in the past) and atime=off.

kyle0r commented 1 year ago

atime=off doesn't seem to help.

image

jxdking commented 1 year ago

For zfs, I would stay away from smr drives. I still remember I had trouble to do "zpool replace" on a single-smr-disk pool. It never completed and degraded the pool! I ended up just dd it to a cmr drive.

severgun commented 1 year ago

It feels like a pretty safe bet that SATA or U.2 SSD's will be my next storage medium - at least for a chassis that I want to perform well on a 10+ GbE network and work optimally with ZFS.

I think ZFS is not suitable for SSDs at all. Keep in mind insane write amplification. Almost always you will hear "just buy expensive ssd". But this is not a solution. This is just delay.

amotin commented 1 year ago

I think ZFS is not suitable for SSDs at all. Keep in mind insane write amplification.

You can think whatever you want. ZFS has a lot of functionality that comes at certain cost. But it is not insane. On the other side ZFS tries to write sequentially, aggregate small I/Os into bigger ones and supports TRIM/UNMAP -- it does what it can to help SSDs. And it can be fast also -- I've made presentations where ZFS doing 20GB/s of read/write or 30GB/s of scrub. Just don't ask it to do impossible, like multiple millions of random 4KB writes -- functionality does have a cost and overhead.

Almost always you will hear "just buy expensive ssd".

There is a reason why we have a separate hardware qualification team -- even with enterprise market devices there are plenty of issues. It goes 10x in cheap consumer market. Don't buy cheap at least, even though it is not a guarantee.

severgun commented 1 year ago

ZFS has a lot of functionality that comes at certain cost.

And users still plan to build ZFS on SSD without knowing that cost. There is no warnings in docs about that.

aggregate small I/Os into bigger ones

By writing them into ZIL that is placed on same SSD? This feature is about performance and not to help with SSD endurance.

functionality does have a cost and overhead.

Yes. But sometimes cost become unbearable.

amotin commented 1 year ago

By writing them into ZIL that is placed on same SSD? This feature is about performans and not to help with SSD endurance.

ZIL is used only when application calls fsync(). Many applications do not require fsync(). If fsync() is needed for data safety in specific application -- there is a cost. And even then, if writes are big enough and there is no SLOG, then data blocks are written only once and just referenced by ZIL. If sync writes are critical part of the workload and main pool devices can't sustain it, then add SLOG that can, like NVDIMM, write-optimized SSD, etc. On opposite, if you don't care, you can always set sync=disabled for specific dataset, disabling ZIL completely.