openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.42k stars 1.72k forks source link

zfs 2.2.4 on Linux: RIP: e030:LZ4_uncompress_unknownOutputSize+0x4c4/0x780 [zfs] #16384

Closed JKDingwall closed 1 month ago

JKDingwall commented 1 month ago

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 22.04 / Jammy
Kernel Version 6.8.0-40
Architecture amd64
OpenZFS Version 2.2.4

Describe the problem you're observing

The system is virtual machine host with block devices allocated as zvols, during guest startup several instances of the following trace are logged and I/O becomes unresponsive. There are also some earlier reports of hung tasks but I don't know if these are related. The configuration is a new build of our existing environment with both zfs (was 2.1.15) and the kernel (was 5.15.0) being updated. I did see https://github.com/openzfs/zfs/issues/12775 but that thread has been quiet for 18+ months.

Describe how to reproduce the problem

Provision zvols for virtual machine (Xen) block storage and then try and boot the guests. For legacy reasons the zpool is initially created with zfs 0.8.3(?) and then zpool upgraded during the first boot. All zpool features are enabled. We are using a local build of the kernel in order to patch in AUFS support. What is also happening during the vm start is that the dom0 memory is ballooning down and it seems that the max_arc_size (un)tunable can end up being = system RAM. Is that a problem?

# cat zfs_arc_max
37038726144

# xl list
Name                                        ID   Mem VCPUs      State   Time(s)
Domain-0                                     0 64280     6     r-----     472.0

# xl mem-set 0 $((16 * 1024))

# xl list
Name                                        ID   Mem VCPUs      State   Time(s)
Domain-0                                     0 16384     6     r-----     530.7

# cat zfs_arc_max
17179869184

16 * 1024^3 = 17179869184

Include any warning/errors/backtraces from the system logs

On the console:

[  397.444580] watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [zvol_tq-0:37189]
[  397.452715] watchdog: BUG: soft lockup - CPU#4 stuck for 23s! [zvol_tq-0:36729]
[  401.439551] watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [zvol_tq-0:373]
[  401.447829] watchdog: BUG: soft lockup - CPU#5 stuck for 26s! [zvol_tq-0:36724]
[  401.455202] watchdog: BUG: soft lockup - CPU#1 stuck for 26s! [zvol_tq-0:37190]

Kernel log:

[  423.923192] RIP: e030:LZ4_uncompress_unknownOutputSize+0x4c4/0x780 [zfs]
[  423.923700] Code: 00 4d 89 f4 4d 39 de 0f 83 22 ff ff ff 48 83 fe 0f 0f 86 fb 01 00 00 48 8b 0b 48 83 c2 20 48 83 c3 20 48 89 4a e0 48 8b 4b e8 <48> 89 4a e8 48 8b 4b f0 48 89 4a f0 48 8b 4b f8 48 89 4a f8 4c 39
[  423.923705] RSP: e02b:ffffc9004f3ab658 EFLAGS: 00000286
[  423.923711] RAX: ffffc9004bcea150 RBX: ffffc900716f7d39 RCX: 0000000000000000
[  423.923715] RDX: ffffc900716f7db9 RSI: 0000000000000080 RDI: 0000000000000080
[  423.923719] RBP: ffffc9004f3ab6d0 R08: ffffc9004bce5004 R09: ffffc9004bcf1b24
[  423.923722] R10: ffffc9007170b000 R11: ffffc9007170afc0 R12: ffffc900716f7dd8
[  423.923726] R13: ffffc900716f7d99 R14: ffffc900716f7dd8 R15: ffffc900716eb000
[  423.923736]  ? abd_iterate_func+0x11e/0x1b0 [zfs]
[  423.924317]  ? __pfx_abd_copy_to_buf_off_cb+0x10/0x10 [zfs]
[  423.924739]  lz4_decompress_zfs+0x3d/0x60 [zfs]
[  423.926300]  zio_decompress_data_buf+0xa1/0x140 [zfs]
[  423.926741]  ? abd_borrow_buf_copy+0x88/0x90 [zfs]
[  423.927204]  zio_decompress_data+0x51/0x90 [zfs]
[  423.927640]  arc_buf_fill+0x116/0xd40 [zfs]
[  423.928102]  arc_buf_alloc_impl.isra.0+0x24c/0x330 [zfs]
[  423.928520]  arc_read+0x1488/0x1790 [zfs]
[  423.928949]  ? __pfx_dbuf_read_done+0x10/0x10 [zfs]
[  423.929424]  dbuf_read_impl.constprop.0+0x490/0x850 [zfs]
[  423.929859]  dbuf_read+0xf4/0x640 [zfs]
[  423.930329]  dbuf_hold_impl+0x4a8/0x830 [zfs]
[  423.930783]  dbuf_hold_impl+0x32c/0x830 [zfs]
[  423.932672]  ? dmu_zfetch_prepare+0x550/0xf30 [zfs]
[  423.933132]  dbuf_hold+0x33/0x70 [zfs]
[  423.933694]  dmu_buf_hold_array_by_dnode+0x155/0x6b0 [zfs]
[  423.934142]  ? spl_kvmalloc+0x84/0xc0 [spl]
[  423.934177]  dmu_read_uio_dnode+0x5a/0x150 [zfs]
[  423.936790]  ? update_io_ticks+0x73/0x80
[  423.936804]  zvol_read+0x127/0x430 [zfs]
[  423.937235]  zvol_read_task+0x12/0x30 [zfs]
[  423.937663]  taskq_thread+0x27f/0x4c0 [spl]
[  423.937696]  ? __pfx_default_wake_function+0x10/0x10
[  423.937709]  ? __pfx_zvol_read_task+0x10/0x10 [zfs]
[  423.938168]  ? __pfx_taskq_thread+0x10/0x10 [spl]
[  423.938197]  kthread+0xef/0x120
[  423.938205]  ? __pfx_kthread+0x10/0x10
[  423.938211]  ret_from_fork+0x44/0x70
[  423.938218]  ? __pfx_kthread+0x10/0x10
[  423.938223]  ret_from_fork_asm+0x1b/0x30
[  423.938233]  </TASK>
[  423.938236] ---[ end trace 0000000000000000 ]---
# zpool get all
NAME       PROPERTY                       VALUE                          SOURCE
diskconvm  size                           1.74T                          -
diskconvm  capacity                       4%                             -
diskconvm  altroot                        -                              default
diskconvm  health                         ONLINE                         -
diskconvm  guid                           12693268832595388421           -
diskconvm  version                        -                              default
diskconvm  bootfs                         -                              default
diskconvm  delegation                     on                             default
diskconvm  autoreplace                    off                            default
diskconvm  cachefile                      -                              default
diskconvm  failmode                       wait                           default
diskconvm  listsnapshots                  on                             local
diskconvm  autoexpand                     on                             local
diskconvm  dedupratio                     1.00x                          -
diskconvm  free                           1.66T                          -
diskconvm  allocated                      82.1G                          -
diskconvm  readonly                       off                            -
diskconvm  ashift                         12                             local
diskconvm  comment                        -                              default
diskconvm  expandsize                     -                              -
diskconvm  freeing                        0                              -
diskconvm  fragmentation                  0%                             -
diskconvm  leaked                         0                              -
diskconvm  multihost                      off                            default
diskconvm  checkpoint                     -                              -
diskconvm  load_guid                      16636724333282575354           -
diskconvm  autotrim                       off                            default
diskconvm  compatibility                  off                            default
diskconvm  bcloneused                     0                              -
diskconvm  bclonesaved                    0                              -
diskconvm  bcloneratio                    1.00x                          -
diskconvm  feature@async_destroy          enabled                        local
diskconvm  feature@empty_bpobj            active                         local
diskconvm  feature@lz4_compress           active                         local
diskconvm  feature@multi_vdev_crash_dump  enabled                        local
diskconvm  feature@spacemap_histogram     active                         local
diskconvm  feature@enabled_txg            active                         local
diskconvm  feature@hole_birth             active                         local
diskconvm  feature@extensible_dataset     active                         local
diskconvm  feature@embedded_data          active                         local
diskconvm  feature@bookmarks              enabled                        local
diskconvm  feature@filesystem_limits      enabled                        local
diskconvm  feature@large_blocks           enabled                        local
diskconvm  feature@large_dnode            enabled                        local
diskconvm  feature@sha512                 enabled                        local
diskconvm  feature@skein                  enabled                        local
diskconvm  feature@edonr                  enabled                        local
diskconvm  feature@userobj_accounting     active                         local
diskconvm  feature@encryption             enabled                        local
diskconvm  feature@project_quota          active                         local
diskconvm  feature@device_removal         enabled                        local
diskconvm  feature@obsolete_counts        enabled                        local
diskconvm  feature@zpool_checkpoint       enabled                        local
diskconvm  feature@spacemap_v2            active                         local
diskconvm  feature@allocation_classes     enabled                        local
diskconvm  feature@resilver_defer         enabled                        local
diskconvm  feature@bookmark_v2            enabled                        local
diskconvm  feature@redaction_bookmarks    enabled                        local
diskconvm  feature@redacted_datasets      enabled                        local
diskconvm  feature@bookmark_written       enabled                        local
diskconvm  feature@log_spacemap           active                         local
diskconvm  feature@livelist               active                         local
diskconvm  feature@device_rebuild         enabled                        local
diskconvm  feature@zstd_compress          enabled                        local
diskconvm  feature@draid                  enabled                        local
diskconvm  feature@zilsaxattr             active                         local
diskconvm  feature@head_errlog            active                         local
diskconvm  feature@blake3                 enabled                        local
diskconvm  feature@block_cloning          enabled                        local
diskconvm  feature@vdev_zaps_v2           active                         local

# zpool status
  pool: diskconvm
 state: ONLINE
config:

        NAME                                                                    STATE     READ WRITE CKSUM
        diskconvm                                                               ONLINE       0     0     0
          mirror-0                                                              ONLINE       0     0     0
            disk/by-partid/ata-SAMSUNG_MZ7L31T9HBLT-00A07_S6ESNE0T510485-part6  ONLINE       0     0     0
            disk/by-partid/ata-SAMSUNG_MZ7L31T9HBLT-00A07_S6ESNE0T510513-part6  ONLINE       0     0     0

errors: No known data errors

# uname -a
Linux <hostname> 6.8.0-40-generic #40~22.04.1 SMP PREEMPT_DYNAMIC Mon Jul 22 18:19:19 UTC  x86_64 x86_64 x86_64 GNU/Linux

# zfs version
zfs-2.2.4-5_g674a6de37a
zfs-kmod-2.2.4-5_g674a6de37a
amotin commented 1 month ago

I don't see how can it be related to lz4 or CPUs getting stuck, but since you mentioned heavy ballooning I have to mention that ZFS is currently quite reluctant to react on kernel's memory pressure. See zfs_arc_shrinker_limit and https://github.com/openzfs/zfs/pull/16197 . I wonder if system is over-provisioned that much, can it cause some swapping on a VM host side, that would look like random CPU freezes for the guest?

JKDingwall commented 1 month ago

With some further investigation it seems trying to do anything 'clever' with the zfsarc{min,max} tunables results in an unhappy system. I'd guess that the common error in LZ4_uncompress_unknownOutputSize could be something to do with the result of a memory allocation not being checked for success but I've not looked in to that further. Leaving those tunables alone and the behaviour seems stable as the dom0 memory balloons. I've patched our zfs build to discourage tinkering:

diff --git a/module/zfs/arc.c b/module/zfs/arc.c
index 195364013..a7fb0c449 100644
--- a/module/zfs/arc.c
+++ b/module/zfs/arc.c
@@ -10662,10 +10662,10 @@ EXPORT_SYMBOL(arc_add_prune_callback);
 EXPORT_SYMBOL(arc_remove_prune_callback);

 ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, min, param_set_arc_min,
-       spl_param_get_u64, ZMOD_RW, "Minimum ARC size in bytes");
+       spl_param_get_u64, ZMOD_RD, "Minimum ARC size in bytes");

 ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, max, param_set_arc_max,
-       spl_param_get_u64, ZMOD_RW, "Maximum ARC size in bytes");
+       spl_param_get_u64, ZMOD_RD, "Maximum ARC size in bytes");

 ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, meta_balance, UINT, ZMOD_RW,
        "Balance between metadata and data on ghost hits.");

For the time being I'd say this is a 'user error' problem and the behaviour with defaults is good.