openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.43k stars 1.73k forks source link

Random LZ4_uncompress_unknownOutputSize erro #12775

Open nim-odoo opened 2 years ago

nim-odoo commented 2 years ago

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 20.04
Kernel Version 5.4.0-90-generic
Architecture x86_64
OpenZFS Version 0.8.3-1ubuntu12.12

Describe the problem you're observing

On our system, we observe kernel errors due to LZ4_uncompress_unknownOutputSize. It happens on very rare occasions: it happened 3 times within a 6 weeks period on 3 different servers (we have a total of 20 bare metal servers, provider A). It hasn't happened (yet) on the virtualized servers (24 servers, provider B).

The main issue is that the complete zpool becomes unavailable: the services using it cannot be restarted, and there is no way to unmount/remount the datasets. Even a soft reboot doesn't succeed: the only solution is the hard reboot.

Therefore, the question is: are we observing the cause or the consequence of a problem? How can this situation of LZ4_uncompress_unknownOutputSize be triggered? Moreover, is there a way to recover the zpool without the hard reboot? It seems that when kernel traceback occurs, our only solution is the hard reboot (we already faced it with zfs send / receive).

Describe how to reproduce the problem

It happens randomly on very rare occasions. We don't have a way to reproduce it.

Include any warning/errors/backtraces from the system logs

BUG: kernel NULL pointer dereference, address: 000000000000000f
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0 
Oops: 0002 [#1] SMP NOPTI
CPU: 9 PID: 1258910 Comm: postgres Tainted: P           O      5.4.0-84-generic #94-Ubuntu
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./E3C246D4U2-2T, BIOS L2.02P 10/19/2020
RIP: 0010:LZ4_uncompress_unknownOutputSize+0x158/0x270 [zfs]
Code: 73 11 0f b6 34 11 40 88 34 13 48 83 c2 01 49 39 d0 75 ef 49 39 fd 75 a0 b8 ff ff ff ff 5b 41 5c 41 5d 41 5e 41 5f 5d c3 b8 ff <00> 00 00 ba 0f 00 00 00 4d 39 d0 0f 83 ed fe ff ff 3d ff 
00 00 00
RSP: 0018:ffffb39ac1ceb850 EFLAGS: 00010246
RAX: 000000000000000f RBX: ffffb39b3f45224d RCX: ffffb39b3f452245
RDX: 000000000000000f RSI: 0000000000000001 RDI: ffffb39b3f45224c
RBP: ffffb39ac1ceb878 R08: ffff951de75c81ef R09: 00000000000000f4
R10: ffff951de75cfa23 R11: ffffb39b3f471ff8 R12: ffffb39b3f452000
R13: ffffb39b3f472000 R14: ffff951de75cfa1b R15: ffffb39b3f471ff7
FS:  00007f3f4bbe6ec0(0000) GS:ffff952c7f440000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000000000f CR3: 0000000cabf5a001 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 lz4_decompress_zfs+0x2b/0x30 [zfs]
 zio_decompress_data_buf+0x3c/0x70 [zfs]
 zio_decompress_data+0x49/0xb0 [zfs]
 arc_buf_fill+0x259/0x4a0 [zfs]
 arc_buf_alloc_impl.isra.0+0x161/0x250 [zfs]
 arc_read+0xea8/0x1100 [zfs]
 ? dbuf_rele_and_unlock+0x4e0/0x4e0 [zfs]
 ? atomic_dec_return+0x9/0x10 [zfs]
 dbuf_read_impl+0x1d2/0x610 [zfs]
 dbuf_read+0xcb/0x5f0 [zfs]
 dmu_buf_hold_array_by_dnode+0x113/0x490 [zfs]
 dmu_read_uio_dnode+0x49/0xf0 [zfs]
 ? zrl_add_impl+0x31/0xb0 [zfs]
 dmu_read_uio_dbuf+0x47/0x60 [zfs]
 zfs_read+0x117/0x300 [zfs]
 zpl_read_common_iovec+0x99/0xe0 [zfs]
 zpl_iter_read_common+0xa8/0x100 [zfs]
 zpl_iter_read+0x58/0x80 [zfs]
 new_sync_read+0x122/0x1b0
 __vfs_read+0x29/0x40
 vfs_read+0xab/0x160
 ksys_pread64+0x6d/0xa0
 __x64_sys_pread64+0x1e/0x20
 do_syscall_64+0x57/0x190
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f3f4f46ebca
Code: ff ff ff eb c5 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 11 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 5e c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
RSP: 002b:00007ffdb1fedb38 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
RAX: ffffffffffffffda RBX: 00007f3f141fc080 RCX: 00007f3f4f46ebca
RDX: 0000000000002000 RSI: 00007f3f141fc080 RDI: 000000000000002b
RBP: 00007ffdb1fedb80 R08: 000000000a00000d R09: 0000000000000000
R10: 00000000000a0000 R11: 0000000000000246 R12: 00007f3f4694f5f0
R13: 00000000000a0000 R14: 0000000000002000 R15: 0000000000000001
Modules linked in: ufs msdos xfs xt_recent nf_log_ipv6 ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt nf_log_ipv4 nf_log_common ipt_REJECT nf_reject_ipv4 xt_LOG xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter ip6_tables iptable_filter bpfilter zfs(PO) zunicode(PO) zlua(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) isofs nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ipmi_ssif kvm_intel kvm joydev input_leds ipmi_si ipmi_devintf ipmi_msghandler video acpi_tad acpi_pad sch_fq_codel msr ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear ast i2c_algo_bit drm_vram_helper ttm drm_kms_helper syscopyarea sysfillrect crct10dif_pclmul sysimgblt crc32_pclmul fb_sys_fops hid_generic ghash_clmulni_intel uas aesni_intel crypto_simd raid1 usbhid ixgbe cryptd ahci nvme xfrm_algo drm glue_helper usb_storage
 hid dca mdio libahci nvme_core
CR2: 000000000000000f
---[ end trace c1034398822c7327 ]---
RIP: 0010:LZ4_uncompress_unknownOutputSize+0x158/0x270 [zfs]
Code: 73 11 0f b6 34 11 40 88 34 13 48 83 c2 01 49 39 d0 75 ef 49 39 fd 75 a0 b8 ff ff ff ff 5b 41 5c 41 5d 41 5e 41 5f 5d c3 b8 ff <00> 00 00 ba 0f 00 00 00 4d 39 d0 0f 83 ed fe ff ff 3d ff 00 00 00
RSP: 0018:ffffb39ac1ceb850 EFLAGS: 00010246
RAX: 000000000000000f RBX: ffffb39b3f45224d RCX: ffffb39b3f452245
RDX: 000000000000000f RSI: 0000000000000001 RDI: ffffb39b3f45224c
RBP: ffffb39ac1ceb878 R08: ffff951de75c81ef R09: 00000000000000f4
R10: ffff951de75cfa23 R11: ffffb39b3f471ff8 R12: ffffb39b3f452000
R13: ffffb39b3f472000 R14: ffff951de75cfa1b R15: ffffb39b3f471ff7
FS:  00007f3f4bbe6ec0(0000) GS:ffff952c7f440000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000000000f CR3: 0000000cabf5a001 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
nim-odoo commented 2 years ago

cc @odony @julienlegros

rincebrain commented 2 years ago

I'm guessing wildly, but what does zpool get all [pool] return for the pool(s) on one of those systems?

odony commented 2 years ago

I'm guessing wildly, but what does zpool get all [pool] return for the pool(s) on one of those systems?

One of the systems has this:

# zpool get all dpool
NAME   PROPERTY                       VALUE                          SOURCE
dpool  size                           1.69T                          -
dpool  capacity                       5%                             -
dpool  altroot                        -                              default
dpool  health                         ONLINE                         -
dpool  guid                           15648689621491837198           -
dpool  version                        -                              default
dpool  bootfs                         -                              default
dpool  delegation                     on                             default
dpool  autoreplace                    off                            default
dpool  cachefile                      -                              default
dpool  failmode                       wait                           default
dpool  listsnapshots                  off                            default
dpool  autoexpand                     on                             local
dpool  dedupditto                     0                              default
dpool  dedupratio                     1.00x                          -
dpool  free                           1.59T                          -
dpool  allocated                      101G                           -
dpool  readonly                       off                            -
dpool  ashift                         12                             local
dpool  comment                        -                              default
dpool  expandsize                     -                              -
dpool  freeing                        0                              -
dpool  fragmentation                  14%                            -
dpool  leaked                         0                              -
dpool  multihost                      off                            default
dpool  checkpoint                     -                              -
dpool  load_guid                      7007957781465625946            -
dpool  autotrim                       on                             local
dpool  feature@async_destroy          enabled                        local
dpool  feature@empty_bpobj            active                         local
dpool  feature@lz4_compress           active                         local
dpool  feature@multi_vdev_crash_dump  enabled                        local
dpool  feature@spacemap_histogram     active                         local
dpool  feature@enabled_txg            active                         local
dpool  feature@hole_birth             active                         local
dpool  feature@extensible_dataset     active                         local
dpool  feature@embedded_data          active                         local
dpool  feature@bookmarks              enabled                        local
dpool  feature@filesystem_limits      enabled                        local
dpool  feature@large_blocks           enabled                        local
dpool  feature@large_dnode            enabled                        local
dpool  feature@sha512                 enabled                        local
dpool  feature@skein                  enabled                        local
dpool  feature@edonr                  enabled                        local
dpool  feature@userobj_accounting     active                         local
dpool  feature@encryption             active                         local
dpool  feature@project_quota          active                         local
dpool  feature@device_removal         enabled                        local
dpool  feature@obsolete_counts        enabled                        local
dpool  feature@zpool_checkpoint       enabled                        local
dpool  feature@spacemap_v2            active                         local
dpool  feature@allocation_classes     enabled                        local
dpool  feature@resilver_defer         enabled                        local
dpool  feature@bookmark_v2            enabled                        local
rincebrain commented 2 years ago

dpool feature@encryption active local

I hate being right (or at least, not proven wrong).

The only bug I know of in recent memory that can cause what looks for all the world like "hey, this buffer suddenly turned into a NULL pointer when I was midway through operating on it, after it passed various points where it should have crashed first if it were NULL" is #11679, which requires encryption. (The bug, as I currently understand it, basically involves the reference count on a buffer being screwed up, so two places think they have the only pointer to the buffer, and then (among other possible fun outcomes) if one of them frees or destroys it, Bad Times Ensue.)

I do not have any other proof of this being the case, so please don't take this as definitive, but that would be my "thing to investigate next" if I were picking bets at the moment, since I haven't seen any other reports of this turn up.

nim-odoo commented 2 years ago

Thanks a lot for your insight, we'll follow #11679. To be complete, we noticed a slightly different trace in the 3 incidents.

Incident 1:

invalid opcode: 0000 [#1] SMP NOPTI
CPU: 11 PID: 3831033 Comm: postgres Tainted: P           O      5.4.0-84-generic #94-Ubuntu
RIP: 0010:LZ4_uncompress_unknownOutputSize+0x3e/0x270 [zfs]
...
Call Trace:
  lz4_decompress_zfs+0x2b/0x30 [zfs]
  zio_decompress_data_buf+0x3c/0x70 [zfs]
  zio_decompress_data+0x49/0xb0 [zfs]
  arc_buf_fill+0x259/0x4a0 [zfs]
  arc_buf_alloc_impl.isra.0+0x161/0x250 [zfs]
  arc_read+0xea8/0x1100 [zfs]
  ? dbuf_rele_and_unlock+0x4e0/0x4e0 [zfs]
  ? atomic_dec_return+0x9/0x10 [zfs]
  dbuf_read_impl+0x1d2/0x610 [zfs]
  ? atomic_inc+0xd/0x20 [zfs]
  dbuf_read+0xcb/0x5f0 [zfs]
  dmu_buf_hold_array_by_dnode+0x113/0x490 [zfs]
  dmu_read_uio_dnode+0x49/0xf0 [zfs]
  ? zrl_add_impl+0x31/0xb0 [zfs]
  dmu_read_uio_dbuf+0x47/0x60 [zfs]
  zfs_read+0x117/0x300 [zfs]
  zpl_read_common_iovec+0x99/0xe0 [zfs]
  zpl_iter_read_common+0xa8/0x100 [zfs]
  zpl_iter_read+0x58/0x80 [zfs]
  new_sync_read+0x122/0x1b0
  __vfs_read+0x29/0x40
  vfs_read+0xab/0x160
  ksys_pread64+0x6d/0xa0
  __x64_sys_pread64+0x1e/0x20
  do_syscall_64+0x57/0x190
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Incident 2:

BUG: unable to handle page fault for address: 0000000008d03690
CPU: 8 PID: 2824511 Comm: postgres Tainted: P           O      5.4.0-84-generic #94-Ubuntu
RIP: 0010:0x8d03690
Code: Bad RIP value.
Call Trace:
 ? LZ4_uncompress_unknownOutputSize+0x102/0x270 [zfs]
 lz4_decompress_zfs+0x2b/0x30 [zfs]
 zio_decompress_data_buf+0x3c/0x70 [zfs]
 zio_decompress_data+0x49/0xb0 [zfs]
 arc_buf_fill+0x259/0x4a0 [zfs]
 arc_buf_alloc_impl.isra.0+0x161/0x250 [zfs]
 arc_read+0xea8/0x1100 [zfs]
 ? dbuf_rele_and_unlock+0x4e0/0x4e0 [zfs]
 ? atomic_dec_return+0x9/0x10 [zfs]
 dbuf_read_impl+0x1d2/0x610 [zfs]
 ? atomic_inc+0xd/0x20 [zfs]
 dbuf_read+0xcb/0x5f0 [zfs]
 dmu_buf_hold_array_by_dnode+0x113/0x490 [zfs]
 dmu_read_uio_dnode+0x49/0xf0 [zfs]
 ? zrl_add_impl+0x31/0xb0 [zfs]
 dmu_read_uio_dbuf+0x47/0x60 [zfs]
 zfs_read+0x117/0x300 [zfs]
 zpl_read_common_iovec+0x99/0xe0 [zfs]
 zpl_iter_read_common+0xa8/0x100 [zfs]
 zpl_iter_read+0x58/0x80 [zfs]
 new_sync_read+0x122/0x1b0
 __vfs_read+0x29/0x40
 vfs_read+0xab/0x160
 ksys_pread64+0x6d/0xa0
 __x64_sys_pread64+0x1e/0x20
 do_syscall_64+0x57/0x190
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Incident 3:

BUG: kernel NULL pointer dereference, address: 000000000000000f
CPU: 9 PID: 1258910 Comm: postgres Tainted: P           O      5.4.0-84-generic #94-Ubuntu
RIP: 0010:LZ4_uncompress_unknownOutputSize+0x158/0x270 [zfs]
Call Trace:
  lz4_decompress_zfs+0x2b/0x30 [zfs]
  zio_decompress_data_buf+0x3c/0x70 [zfs]
  ...
 (stack trace in first message)
nim-odoo commented 2 years ago

Ah, 4th incident on the same machine than incident 2. The BUG match (unable to handle page fault for address)

kernel tried to execute NX-protected page - exploit attempt? (uid: 115)
BUG: unable to handle page fault for address: ffffbdc80096b888
Oops: 0011 [#1] SMP NOPTI
CPU: 8 PID: 4103458 Comm: postgres Tainted: P           O      5.4.0-90-generic #101-Ubuntu
Call Trace:
 ? lz4_decompress_zfs+0x2b/0x30 [zfs]
 ? zio_decompress_data_buf+0x3c/0x70 [zfs]
 ? zio_decompress_data+0x49/0xb0 [zfs]
 ? arc_buf_fill+0x259/0x4a0 [zfs]
 ? arc_buf_alloc_impl.isra.0+0x161/0x250 [zfs]
 ? arc_read+0xea8/0x1100 [zfs]
 ? dbuf_rele_and_unlock+0x4e0/0x4e0 [zfs]
 ? atomic_dec_return+0x9/0x10 [zfs]
 ? dbuf_read_impl+0x1d2/0x610 [zfs]
 ? dbuf_read+0xcb/0x5f0 [zfs]
 ? dmu_buf_hold_array_by_dnode+0x113/0x490 [zfs]
 ? dmu_read_uio_dnode+0x49/0xf0 [zfs]
 ? zrl_add_impl+0x31/0xb0 [zfs]
 ? dmu_read_uio_dbuf+0x47/0x60 [zfs]
 ? zfs_read+0x117/0x300 [zfs]
 ? zpl_read_common_iovec+0x99/0xe0 [zfs]
 ? zpl_iter_read_common+0xa8/0x100 [zfs]
 ? zpl_iter_read+0x58/0x80 [zfs]
 ? new_sync_read+0x122/0x1b0
 ? __vfs_read+0x29/0x40
 ? vfs_read+0xab/0x160
 ? ksys_pread64+0x6d/0xa0
 ? __x64_sys_pread64+0x1e/0x20
 ? do_syscall_64+0x57/0x190
 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
sbidoul commented 2 years ago

We have hit the same issue today.

Type Version/Name
Distribution Name Ubuntu
Distribution Version 20.04.3 LTS
Kernel Version 5.4.0-91-generic #102-Ubuntu
Architecture x86_64
OpenZFS Version zfs-0.8.3-1ubuntu12.13

One noticeable operation that was running at that moment (and froze) was a recv of an unencrypted dataset.

general protection fault: 0000 [#1] SMP NOPTI
CPU: 13 PID: 1129912 Comm: zfs Tainted: P           O      5.4.0-91-generic #102-Ubuntu
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./E3C246D4U2-2T, BIOS L2.02K 12/18/2019
RIP: 0010:LZ4_uncompress_unknownOutputSize+0x3e/0x270 [zfs]
Code: 57 41 56 41 55 4c 8d 2c 0e 41 54 53 4c 39 d7 0f 83 32 02 00 00 48 89 f8 49 89 f4 4d 8d 5d f8 48 89 f7 4d 8d 72 f8 4d 8d 7d f7 <4c> 8d 40 01 0f b6 00 41 89 c1 c1 e8 04 89 c2 83 f8 0f 0f 84 00 01
RSP: 0018:ffffa8bb345676a8 EFLAGS: 00010297
RAX: ffff953bed6cae09 RBX: ffff953bed6ca20a RCX: ffff953bed6ca202
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff953bed6ca207
RBP: ffffa8bb345676d0 R08: ffff953bed6cae05 R09: 0000000000000001
R10: ffff953bed6cae1c R11: ffff953bed6ca3f8 R12: ffff953bed6ca200
R13: ffff953bed6ca400 R14: ffff953bed6cae14 R15: ffff953bed6ca3f7
FS:  00007f450f2577c0(0000) GS:ffff9547ff540000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f450ed80000 CR3: 0000000abcce2002 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 lz4_decompress_zfs+0x2b/0x30 [zfs]
 zio_decompress_data_buf+0x3c/0x70 [zfs]
 zio_decompress_data+0x49/0xb0 [zfs]
 zio_decompress+0x5a/0xd0 [zfs]
 zio_pop_transforms+0x74/0x90 [zfs]
 zio_done+0x23e/0xe20 [zfs]
 ? __raw_spin_unlock+0x9/0x10 [zfs]
 ? zio_ready+0x25e/0x3f0 [zfs]
 ? abd_return_buf_copy+0x31/0x50 [zfs]
 zio_nowait+0x9e/0x120 [zfs]
 arc_read+0x8cb/0x1100 [zfs]
 ? dbuf_rele_and_unlock+0x4e0/0x4e0 [zfs]
 ? do_raw_spin_unlock+0x9/0x10 [zfs]
 dbuf_read_impl+0x1d2/0x610 [zfs]
 ? zio_null+0x2f/0x40 [zfs]
 dbuf_read+0xcb/0x5f0 [zfs]
 ? dmu_buf_hold_noread+0xc5/0x120 [zfs]
 dmu_buf_hold+0x62/0x90 [zfs]
 zap_lockdir+0x4f/0xb0 [zfs]
 zap_lookup_norm+0x5a/0xc0 [zfs]
 zap_lookup+0x16/0x20 [zfs]
 dsl_prop_get_dd+0x192/0x230 [zfs]
 dsl_prop_get_ds+0x172/0x210 [zfs]
 ? __list_add+0x17/0x40 [zfs]
 ? redundant_metadata_changed_cb+0x20/0x20 [zfs]
 dsl_prop_get_int_ds+0x1e/0x20 [zfs]
 dsl_prop_register+0x41/0x160 [zfs]
 dmu_objset_open_impl+0x966/0x9f0 [zfs]
 dmu_objset_from_ds+0xec/0x180 [zfs]
 dmu_objset_hold_flags+0x84/0xd0 [zfs]
 dmu_objset_hold+0x16/0x20 [zfs]
 zfs_ioc_objset_stats+0x31/0x80 [zfs]
 zfsdev_ioctl+0x5c8/0x690 [zfs]
 do_vfs_ioctl+0x407/0x670
 ? do_user_addr_fault+0x216/0x450
 ksys_ioctl+0x67/0x90
 __x64_sys_ioctl+0x1a/0x20
 do_syscall_64+0x57/0x190
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f450f84b50b
Code: 0f 1e fa 48 8b 05 85 39 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 55 39 0d 00 f7 d8 64 89 01 48
RSP: 002b:00007ffe94486ee8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffe94486f10 RCX: 00007f450f84b50b
RDX: 00007ffe94486f10 RSI: 0000000000005a12 RDI: 0000000000000003
RBP: 00007ffe94486f00 R08: 00007f450ed80010 R09: 0000000000000000
R10: 0000000000000022 R11: 0000000000000246 R12: 0000561440267150
R13: 00007ffe94486f10 R14: 0000561440267150 R15: 00007ffe9448aa80
Modules linked in: xt_multiport binfmt_misc macvlan nf_tables nfnetlink vhost_vsock vmw_vsock_virtio_transport_common vsock vhost uas usb_storage nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua zfs(PO) zunicode(PO) zlua(PO) zavl(PO) icp(PO) ipmi_ssif zcommon(PO) znvpair(PO) spl(O) kvm_intel kvm input_leds joydev ipmi_si ipmi_devintf ipmi_msghandler video acpi_pad acpi_tad 8021q garp mrp stp llc nf_log_ipv6 ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt nf_log_ipv4 nf_log_common ipt_REJECT nf_reject_ipv4 xt_LOG xt_limit xt_addrtype xt_tcpudp sch_fq_codel xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter ip6_tables iptable_filter bpfilter msr ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear hid_generic usbhid hid raid1 i2c_algo_bit crct10dif_pclmul drm_vram_helper crc32_pclmul ttm ghash_clmulni_intel drm_kms_helper aesni_intel syscopyarea sysfillrect
 sysimgblt crypto_simd ixgbe fb_sys_fops cryptd glue_helper nvme xfrm_algo drm dca ahci nvme_core mdio libahci
---[ end trace ec4d2057154f3d6d ]---
RIP: 0010:LZ4_uncompress_unknownOutputSize+0x3e/0x270 [zfs]
Code: 57 41 56 41 55 4c 8d 2c 0e 41 54 53 4c 39 d7 0f 83 32 02 00 00 48 89 f8 49 89 f4 4d 8d 5d f8 48 89 f7 4d 8d 72 f8 4d 8d 7d f7 <4c> 8d 40 01 0f b6 00 41 89 c1 c1 e8 04 89 c2 83 f8 0f 0f 84 00 01
RSP: 0018:ffffa8bb345676a8 EFLAGS: 00010297
RAX: ffff953bed6cae09 RBX: ffff953bed6ca20a RCX: ffff953bed6ca202
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff953bed6ca207
RBP: ffffa8bb345676d0 R08: ffff953bed6cae05 R09: 0000000000000001
R10: ffff953bed6cae1c R11: ffff953bed6ca3f8 R12: ffff953bed6ca200
R13: ffff953bed6ca400 R14: ffff953bed6cae14 R15: ffff953bed6ca3f7
FS:  00007f450f2577c0(0000) GS:ffff9547ff540000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f450ed80000 CR3: 0000000abcce2002 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
rincebrain commented 2 years ago

We have hit the same issue today.

Type Version/Name Distribution Name Ubuntu Distribution Version 20.04.3 LTS Kernel Version 5.4.0-91-generic #102-Ubuntu Architecture x86_64 OpenZFS Version zfs-0.8.3-1ubuntu12.13 One noticeable operation that was running at that moment (and froze) was a recv of an unencrypted dataset.

general protection fault: 0000 [#1] SMP NOPTI
CPU: 13 PID: 1129912 Comm: zfs Tainted: P           O      5.4.0-91-generic #102-Ubuntu
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./E3C246D4U2-2T, BIOS L2.02K 12/18/2019
RIP: 0010:LZ4_uncompress_unknownOutputSize+0x3e/0x270 [zfs]
Code: 57 41 56 41 55 4c 8d 2c 0e 41 54 53 4c 39 d7 0f 83 32 02 00 00 48 89 f8 49 89 f4 4d 8d 5d f8 48 89 f7 4d 8d 72 f8 4d 8d 7d f7 <4c> 8d 40 01 0f b6 00 41 89 c1 c1 e8 04 89 c2 83 f8 0f 0f 84 00 01
RSP: 0018:ffffa8bb345676a8 EFLAGS: 00010297
RAX: ffff953bed6cae09 RBX: ffff953bed6ca20a RCX: ffff953bed6ca202
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff953bed6ca207
RBP: ffffa8bb345676d0 R08: ffff953bed6cae05 R09: 0000000000000001
R10: ffff953bed6cae1c R11: ffff953bed6ca3f8 R12: ffff953bed6ca200
R13: ffff953bed6ca400 R14: ffff953bed6cae14 R15: ffff953bed6ca3f7
FS:  00007f450f2577c0(0000) GS:ffff9547ff540000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f450ed80000 CR3: 0000000abcce2002 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 lz4_decompress_zfs+0x2b/0x30 [zfs]
 zio_decompress_data_buf+0x3c/0x70 [zfs]
 zio_decompress_data+0x49/0xb0 [zfs]
 zio_decompress+0x5a/0xd0 [zfs]
 zio_pop_transforms+0x74/0x90 [zfs]
 zio_done+0x23e/0xe20 [zfs]
 ? __raw_spin_unlock+0x9/0x10 [zfs]
 ? zio_ready+0x25e/0x3f0 [zfs]
 ? abd_return_buf_copy+0x31/0x50 [zfs]
 zio_nowait+0x9e/0x120 [zfs]
 arc_read+0x8cb/0x1100 [zfs]
 ? dbuf_rele_and_unlock+0x4e0/0x4e0 [zfs]
 ? do_raw_spin_unlock+0x9/0x10 [zfs]
 dbuf_read_impl+0x1d2/0x610 [zfs]
 ? zio_null+0x2f/0x40 [zfs]
 dbuf_read+0xcb/0x5f0 [zfs]
 ? dmu_buf_hold_noread+0xc5/0x120 [zfs]
 dmu_buf_hold+0x62/0x90 [zfs]
 zap_lockdir+0x4f/0xb0 [zfs]
 zap_lookup_norm+0x5a/0xc0 [zfs]
 zap_lookup+0x16/0x20 [zfs]
 dsl_prop_get_dd+0x192/0x230 [zfs]
 dsl_prop_get_ds+0x172/0x210 [zfs]
 ? __list_add+0x17/0x40 [zfs]
 ? redundant_metadata_changed_cb+0x20/0x20 [zfs]
 dsl_prop_get_int_ds+0x1e/0x20 [zfs]
 dsl_prop_register+0x41/0x160 [zfs]
 dmu_objset_open_impl+0x966/0x9f0 [zfs]
 dmu_objset_from_ds+0xec/0x180 [zfs]
 dmu_objset_hold_flags+0x84/0xd0 [zfs]
 dmu_objset_hold+0x16/0x20 [zfs]
 zfs_ioc_objset_stats+0x31/0x80 [zfs]
 zfsdev_ioctl+0x5c8/0x690 [zfs]
 do_vfs_ioctl+0x407/0x670
 ? do_user_addr_fault+0x216/0x450
 ksys_ioctl+0x67/0x90
 __x64_sys_ioctl+0x1a/0x20
 do_syscall_64+0x57/0x190
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f450f84b50b
Code: 0f 1e fa 48 8b 05 85 39 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 55 39 0d 00 f7 d8 64 89 01 48
RSP: 002b:00007ffe94486ee8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffe94486f10 RCX: 00007f450f84b50b
RDX: 00007ffe94486f10 RSI: 0000000000005a12 RDI: 0000000000000003
RBP: 00007ffe94486f00 R08: 00007f450ed80010 R09: 0000000000000000
R10: 0000000000000022 R11: 0000000000000246 R12: 0000561440267150
R13: 00007ffe94486f10 R14: 0000561440267150 R15: 00007ffe9448aa80
Modules linked in: xt_multiport binfmt_misc macvlan nf_tables nfnetlink vhost_vsock vmw_vsock_virtio_transport_common vsock vhost uas usb_storage nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua zfs(PO) zunicode(PO) zlua(PO) zavl(PO) icp(PO) ipmi_ssif zcommon(PO) znvpair(PO) spl(O) kvm_intel kvm input_leds joydev ipmi_si ipmi_devintf ipmi_msghandler video acpi_pad acpi_tad 8021q garp mrp stp llc nf_log_ipv6 ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt nf_log_ipv4 nf_log_common ipt_REJECT nf_reject_ipv4 xt_LOG xt_limit xt_addrtype xt_tcpudp sch_fq_codel xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter ip6_tables iptable_filter bpfilter msr ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear hid_generic usbhid hid raid1 i2c_algo_bit crct10dif_pclmul drm_vram_helper crc32_pclmul ttm ghash_clmulni_intel drm_kms_helper aesni_intel syscopyarea sysfillrect
 sysimgblt crypto_simd ixgbe fb_sys_fops cryptd glue_helper nvme xfrm_algo drm dca ahci nvme_core mdio libahci
---[ end trace ec4d2057154f3d6d ]---
RIP: 0010:LZ4_uncompress_unknownOutputSize+0x3e/0x270 [zfs]
Code: 57 41 56 41 55 4c 8d 2c 0e 41 54 53 4c 39 d7 0f 83 32 02 00 00 48 89 f8 49 89 f4 4d 8d 5d f8 48 89 f7 4d 8d 72 f8 4d 8d 7d f7 <4c> 8d 40 01 0f b6 00 41 89 c1 c1 e8 04 89 c2 83 f8 0f 0f 84 00 01
RSP: 0018:ffffa8bb345676a8 EFLAGS: 00010297
RAX: ffff953bed6cae09 RBX: ffff953bed6ca20a RCX: ffff953bed6ca202
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff953bed6ca207
RBP: ffffa8bb345676d0 R08: ffff953bed6cae05 R09: 0000000000000001
R10: ffff953bed6cae1c R11: ffff953bed6ca3f8 R12: ffff953bed6ca200
R13: ffff953bed6ca400 R14: ffff953bed6cae14 R15: ffff953bed6ca3f7
FS:  00007f450f2577c0(0000) GS:ffff9547ff540000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f450ed80000 CR3: 0000000abcce2002 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Curious. Does it repeat every time if you receive that same dataset?

sbidoul commented 2 years ago

Curious. Does it repeat every time if you receive that same dataset?

No. It has been running every 15 minutes for months, and it happened only once so far.

nim-odoo commented 2 years ago

Hello,

I would like to re-launch the discussion here. We have moved forward with our ZFS deployment on Ubuntu 20.04 / 22.04, and we met several tracebacks leading to a freeze of the zpool.

Incident 5

zfs-0.8.3-1ubuntu12.14
zfs-kmod-0.8.3-1ubuntu12.14
general protection fault: 0000 [#1] SMP NOPTI
CPU: 2 PID: 35075 Comm: z_rd_int Tainted: P           O      5.4.0-110-generic #124-Ubuntu
Hardware name: To Be Filled By O.E.M. E3C252D4U-2T/E3C252D4U-2T/OVH, BIOS 1.31.OV02 01/28/2022
RIP: 0010:fletcher_4_avx512f_native+0x3e/0x90 [zcommon]
Code: 53 48 89 f3 e8 a3 ff ff ff 62 d1 fe 48 6f 45 00 62 d1 fe 48 6f 4d 01 62 d1 fe 48 6f 55 02 62 d1 fe 48 6f 5d 03 4c 39 e3 73 27 <62> f2 7d 48 35 23 62 f1 fd 48 d4 c4 62 f1 f5 48 d4 c8 
62 f1 ed 48
RSP: 0018:ffffc12d825f38c8 EFLAGS: 00010087
RAX: 0000000000000000 RBX: 6000e22604d0f000 RCX: ffff9d6bb5bf8c00
RDX: 00000000ffffffff RSI: 6000e22604d0f000 RDI: ffff9d791e294000
RBP: ffffc12d825f38e0 R08: ffffc12d825f39e0 R09: 000001a5940ae000
R10: ffff9d79567a3d18 R11: 0000000000000000 R12: 6000e22604d13000
R13: ffffc12d825f3a00 R14: 6000e22604d0f000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff9d795f880000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f04badc8024 CR3: 000000029fe68001 CR4: 0000000000760ee0
PKRU: 55555554
Call Trace:
 abd_fletcher_4_iter+0x6a/0x90 [zcommon]
 ? abd_fletcher_4_simd2scalar+0x60/0x60 [zcommon]
 abd_iterate_func+0x6b/0xc0 [zfs]
 abd_fletcher_4_impl+0x39/0x50 [zfs]
 abd_fletcher_4_native+0x52/0x70 [zfs]
 ? alloc_pages_current+0x87/0xe0
 ? check_preempt_curr+0x7a/0x90
 ? ttwu_do_wakeup+0x1e/0x150
 zio_checksum_error_impl+0x120/0x6a0 [zfs]
 ? abd_fletcher_4_impl+0x50/0x50 [zfs]
 ? do_raw_spin_unlock+0x9/0x10 [zfs]
 ? __raw_spin_unlock+0x9/0x10 [zfs]
 ? __switch_to_asm+0x40/0x70
 ? __switch_to_asm+0x34/0x70
 ? __switch_to_asm+0x40/0x70
 ? __switch_to_asm+0x34/0x70
 ? __switch_to_asm+0x40/0x70
 ? __switch_to_asm+0x34/0x70
 ? __switch_to_asm+0x40/0x70
 ? __switch_to_asm+0x34/0x70
 ? __switch_to_asm+0x40/0x70
 ? __switch_to_asm+0x34/0x70
 ? __switch_to_asm+0x40/0x70
 ? __switch_to_asm+0x34/0x70
 ? __switch_to_asm+0x40/0x70
 ? __switch_to_asm+0x34/0x70
 ? __switch_to_asm+0x40/0x70
 ? __switch_to_asm+0x34/0x70
 ? __switch_to_asm+0x40/0x70
 ? ktime_get_raw_ts64+0x3e/0xd0
 ? vdev_queue_class_to_issue+0xc0/0xf0 [zfs]
 zio_checksum_error+0x6c/0xd0 [zfs]
 zio_checksum_verify+0x3b/0x150 [zfs]
 ? __raw_spin_unlock+0x9/0x10 [zfs]
 ? zio_wait_for_children+0x94/0xc0 [zfs]
 ? zio_vdev_io_assess+0x26/0x280 [zfs]
 zio_execute+0x93/0xe0 [zfs]
 taskq_thread+0x245/0x430 [spl]
 ? wake_up_q+0x70/0x70
 ? zio_execute_stack_check.constprop.0+0x10/0x10 [zfs]
 kthread+0x104/0x140
 ? task_done+0x90/0x90 [spl]
 ? kthread_park+0x90/0x90
 ret_from_fork+0x1f/0x40
Modules linked in: xt_recent ufs msdos xfs xt_nat veth xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat br_netfilter bridge stp llc aufs overlay zfs(PO) zunicode(PO) zlua(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) isofs nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ipmi_ssif kvm_intel input_leds joydev kvm ipmi_si acpi_tad video sch_fq_codel nf_log_ipv6 ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt nf_log_ipv4 nf_log_common ipt_REJECT nf_reject_ipv4 xt_LOG xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter ip6_tables iptable_filter bpfilter ipmi_devintf ipmi_msghandler msr ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear ast i2c_algo_bit drm_vram_helper ttm drm_kms_helper crct10dif_pclmul crc32_pclmul syscopyarea ghash_clmulni_intel sysfillrect sysimgblt
 hid_generic raid1 aesni_intel fb_sys_fops crypto_simd usbhid cryptd drm i40e glue_helper ahci hid nvme libahci nvme_core
---[ end trace 32b9c8af5a059aff ]---
RIP: 0010:fletcher_4_avx512f_native+0x3e/0x90 [zcommon]
nim-odoo commented 2 years ago

Incident 6

zfs-2.1.4-0ubuntu0.1
zfs-kmod-2.1.4-0ubuntu0.1
general protection fault, probably for non-canonical address 0xdead0000000000fc: 0000 [#2] SMP NOPTI
CPU: 4 PID: 3578865 Comm: z_wr_int_0 Tainted: P      D    O      5.15.0-46-generic #49-Ubuntu
Hardware name: To Be Filled By O.E.M. E3C252D4U-2T/E3C252D4U-2T/OVH, BIOS 1.31.OV02 01/28/2022
RIP: 0010:abd_gang_get_offset+0x42/0x90 [zfs]
Code: e8 23 f6 ff ff 48 85 c0 74 36 4d 8b 2c 24 48 89 c6 eb 17 49 29 c5 48 89 df 4d 89 2c 24 e8 36 f6 ff ff 48 89 c6 48 85 c0 74 16 <8b> 46 04 4c 39 e8 76 e1 5b 48 89 f0 41 5c 41 5d 5d c3
 cc cc cc cc
RSP: 0018:ffffa82d89e67ba8 EFLAGS: 00010282
RAX: dead0000000000f8 RBX: ffff8b69b2fc0868 RCX: 000000189888d000
RDX: 0000000000000000 RSI: dead0000000000f8 RDI: ffff8b69b2fc0878
RBP: ffffa82d89e67bc0 R08: 00000000000000c0 R09: ffff8b72b73d66d8
R10: 0000000000000000 R11: 0000000000000000 R12: ffffa82d89e67bd0
R13: 0000000000000000 R14: 0000000000000000 R15: ffff8b69f5fcf020
FS:  0000000000000000(0000) GS:ffff8b765fb00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055c5e17c7c28 CR3: 0000000106882003 CR4: 0000000000770ee0
PKRU: 55555554
Call Trace:
 <TASK>
 abd_nr_pages_off+0x80/0xf0 [zfs]
 __vdev_disk_physio.constprop.0+0x1a1/0x300 [zfs]
 ? __cond_resched+0x1a/0x50
 ? down_read+0x13/0xa0
 ? RW_WRITE_HELD.constprop.0+0x9/0x30 [zfs]
 vdev_disk_io_start+0x95/0x360 [zfs]
 zio_vdev_io_start+0xf2/0x2d0 [zfs]
 ? vdev_queue_pending_add+0x63/0xa0 [zfs]
 zio_execute+0x8f/0xe0 [zfs]
 vdev_queue_io_done+0x9c/0x160 [zfs]
 zio_vdev_io_done+0x8e/0x1e0 [zfs]
 zio_execute+0x8f/0xe0 [zfs]
 taskq_thread+0x21e/0x400 [spl]
 ? wake_up_q+0x90/0x90
 ? zio_execute_stack_check.constprop.0+0x20/0x20 [zfs]
 ? param_set_taskq_kick+0xf0/0xf0 [spl]
 kthread+0x127/0x150
 ? set_kthread_struct+0x50/0x50
 ret_from_fork+0x1f/0x30
 </TASK>
Modules linked in: veth tls xt_nat nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink xfrm_user xfrm_algo br_netfilter bridge stp llc overlay isofs ip6t_REJECT nf_reject_ipv6 xt_hl ip6_tables ip6t_rt ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nft_counter nf_tables nfnetlink nls_iso8859_1 ipmi_ssif zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) kvm_intel znvpair(PO) spl(O) kvm joydev input_leds efi_pstore acpi_ipmi ipmi_si video acpi_tad sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ipmi_devintf ipmi_msghandler msr ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm drm_kms_helper hid_generic syscopyarea sysfillrect sysimgblt crct10dif_pclmul
 fb_sys_fops crc32_pclmul usbhid ghash_clmulni_intel cec raid1 hid aesni_intel rc_core nvme crypto_simd ahci xhci_pci drm i40e cryptd libahci nvme_core xhci_pci_renesas
---[ end trace a93c18a4ea079033 ]---
RIP: 0010:fletcher_4_avx512f_native+0x3e/0xa0 [zcommon]
nim-odoo commented 2 years ago

Incident 7

zfs-0.8.3-1ubuntu12.14
zfs-kmod-0.8.3-1ubuntu12.14
BUG: unable to handle page fault for address: ffffffff40b8d4ae
#PF: supervisor instruction fetch in kernel mode
#PF: error_code(0x0010) - not-present page
PGD 44140e067 P4D 44140e067 PUD 0 
Oops: 0010 [#1] SMP NOPTI
CPU: 9 PID: 3102353 Comm: postgres Tainted: P           O      5.4.0-120-generic #136-Ubuntu
Hardware name: ASUSTeK COMPUTER INC. 3105-S21031001/P11C-M-10G-2T Series, BIOS 3105-S21031001 03/10/2021
RIP: 0010:LZ4_uncompress_unknownOutputSize+0x3e/0x270 [zfs]
Code: 57 41 56 41 55 4c 8d 2c 0e 41 54 53 4c 39 d7 0f 83 32 02 00 00 48 89 f8 49 89 f4 4d 8d 5d f8 48 89 f7 4d 8d 72 f8 4d 8d 7d f7 <4c> 8d 40 01 0f b6 00 41 89 c1 c1 e8 04 89 c2 83 f8 0f 0f
 84 00 01
RSP: 0018:ffffa2c30c043850 EFLAGS: 00010297
RAX: ffffa2c32d9fb5f5 RBX: ffffa2c37ddd9bab RCX: ffffa2c37ddd9ba3
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffa2c37ddd9ba8
RBP: ffffa2c30c043878 R08: ffffa2c32d9fb5d0 R09: 0000000000000001
R10: ffffa2c32da03abc R11: ffffa2c37ddf1ff8 R12: ffffa2c37ddd2000
R13: ffffa2c37ddf2000 R14: ffffa2c32da03ab4 R15: ffffa2c37ddf1ff7
FS:  00007f78618eaec0(0000) GS:ffff8b2e6ea40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffff40b8d4ae CR3: 0000000b1efc4006 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 lz4_decompress_zfs+0x2b/0x30 [zfs]
 zio_decompress_data_buf+0x3c/0x70 [zfs]
 zio_decompress_data+0x49/0xb0 [zfs]
 arc_buf_fill+0x259/0x4a0 [zfs]
 arc_buf_alloc_impl.isra.0+0x161/0x250 [zfs]
 arc_read+0xea8/0x1100 [zfs]
 ? dbuf_rele_and_unlock+0x4e0/0x4e0 [zfs]
 ? atomic_dec_return+0x9/0x10 [zfs]
 dbuf_read_impl+0x1d2/0x610 [zfs]
 dbuf_read+0xcb/0x5f0 [zfs]
 dmu_buf_hold_array_by_dnode+0x113/0x490 [zfs]
 dmu_read_uio_dnode+0x49/0xf0 [zfs]
 ? zrl_add_impl+0x31/0xb0 [zfs]
 dmu_read_uio_dbuf+0x47/0x60 [zfs]
 zfs_read+0x117/0x300 [zfs]
 zpl_read_common_iovec+0x99/0xe0 [zfs]
 zpl_iter_read_common+0xa8/0x100 [zfs]
 zpl_iter_read+0x58/0x80 [zfs]
 new_sync_read+0x122/0x1b0
 __vfs_read+0x29/0x40
 vfs_read+0xab/0x160
 ksys_pread64+0x6d/0xa0
 __x64_sys_pread64+0x1e/0x20
 do_syscall_64+0x57/0x190
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f7865170c2a
Code: ff ff ff eb c5 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 11 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 5e c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
RSP: 002b:00007fffdc6b0088 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
RAX: ffffffffffffffda RBX: 00007f7827242080 RCX: 00007f7865170c2a
RDX: 0000000000002000 RSI: 00007f7827242080 RDI: 0000000000000012
RBP: 00007fffdc6b00d0 R08: 000000000a00000d R09: 0000000000000000
R10: 000000000002c000 R11: 0000000000000246 R12: 00007f785c641fb0
R13: 000000000002c000 R14: 0000000000002000 R15: 0000000000000001
Modules linked in: ufs msdos xfs xt_recent zfs(PO) zunicode(PO) zlua(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) isofs nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ipmi_ssif kvm_intel kvm joydev input_leds ipmi_si video acpi_tad sch_fq_codel nf_log_ipv6 ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt nf_log_ipv4 nf_log_common ipt_REJECT nf_reject_ipv4 xt_LOG xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter ip6_tables iptable_filter bpfilter ipmi_devintf ipmi_msghandler msr ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear ast i2c_algo_bit drm_vram_helper ttm hid_generic drm_kms_helper raid1 ixgbe syscopyarea crct10dif_pclmul sysfillrect crc32_pclmul sysimgblt usbhid ghash_clmulni_intel fb_sys_fops aesni_intel uas crypto_simd xfrm_algo cryptd drm hid nvme glue_helper ahci dca usb_storage
 mdio libahci nvme_core
CR2: ffffffff40b8d4ae
---[ end trace aa592810eb96689b ]---
RIP: 0010:LZ4_uncompress_unknownOutputSize+0x3e/0x270 [zfs]
Code: 57 41 56 41 55 4c 8d 2c 0e 41 54 53 4c 39 d7 0f 83 32 02 00 00 48 89 f8 49 89 f4 4d 8d 5d f8 48 89 f7 4d 8d 72 f8 4d 8d 7d f7 <4c> 8d 40 01 0f b6 00 41 89 c1 c1 e8 04 89 c2 83 f8 0f 0f 84 00 01
RSP: 0018:ffffa2c30c043850 EFLAGS: 00010297
RAX: ffffa2c32d9fb5f5 RBX: ffffa2c37ddd9bab RCX: ffffa2c37ddd9ba3
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffa2c37ddd9ba8
RBP: ffffa2c30c043878 R08: ffffa2c32d9fb5d0 R09: 0000000000000001
R10: ffffa2c32da03abc R11: ffffa2c37ddf1ff8 R12: ffffa2c37ddd2000
R13: ffffa2c37ddf2000 R14: ffffa2c32da03ab4 R15: ffffa2c37ddf1ff7
FS:  00007f78618eaec0(0000) GS:ffff8b2e6ea40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffff40b8d4ae CR3: 0000000b1efc4006 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
nim-odoo commented 2 years ago

In the end, I don't know how / if these are related to #11679. We use an encrypted dataset, but we do not use the zfs send / receive feature.

Although these crashes are uncommon, the increased number of machines involved make it look appear more and more frequently.

nim-odoo commented 1 year ago

Incident 8

zfs-0.8.3-1ubuntu12.14
zfs-kmod-0.8.3-1ubuntu12.14
BUG: unable to handle page fault for address: ffffffff001e0426
#PF: supervisor instruction fetch in kernel mode
#PF: error_code(0x0010) - not-present page
PGD 53900e067 P4D 53900e067 PUD 0 
Oops: 0010 [#1] SMP NOPTI
CPU: 13 PID: 301321 Comm: postgres Tainted: P           O      5.4.0-120-generic #136-Ubuntu
Hardware name: ASUSTeK COMPUTER INC. 3105-S21031001/P11C-M-10G-2T Series, BIOS 3105-S21031001 03/10/2021
RIP: 0010:LZ4_uncompress_unknownOutputSize+0x3e/0x270 [zfs]
Code: 57 41 56 41 55 4c 8d 2c 0e 41 54 53 4c 39 d7 0f 83 32 02 00 00 48 89 f8 49 89 f4 4d 8d 5d f8 48 89 f7 4d 8d 72 f8 4d 8d 7d f7 <4c> 8d 40 01 0f b6 00 41 89 c1 c1 e8 04 89 c2 83 f8 0f 0f
 84 00 01
RSP: 0018:ffffb68e8c153850 EFLAGS: 00010297
RAX: ffffb68ef47dd697 RBX: ffffb68f0051df6c RCX: ffffb68f0051df64
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffb68f0051df68
RBP: ffffb68e8c153878 R08: ffffb68ef47dd692 R09: 0000000000000000
R10: ffffb68ef47e67eb R11: ffffb68f00535ff8 R12: ffffb68f00516000
R13: ffffb68f00536000 R14: ffffb68ef47e67e3 R15: ffffb68f00535ff7
FS:  00007fadf68caec0(0000) GS:ffff933aeeb40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffff001e0426 CR3: 0000000465702001 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 lz4_decompress_zfs+0x2b/0x30 [zfs]
 zio_decompress_data_buf+0x3c/0x70 [zfs]
 zio_decompress_data+0x49/0xb0 [zfs]
 arc_buf_fill+0x259/0x4a0 [zfs]
 arc_buf_alloc_impl.isra.0+0x161/0x250 [zfs]
 arc_read+0xea8/0x1100 [zfs]
 ? dbuf_rele_and_unlock+0x4e0/0x4e0 [zfs]
 ? atomic_dec_return+0x9/0x10 [zfs]
 dbuf_read_impl+0x1d2/0x610 [zfs]
 dbuf_read+0xcb/0x5f0 [zfs]
 dmu_buf_hold_array_by_dnode+0x113/0x490 [zfs]
 dmu_read_uio_dnode+0x49/0xf0 [zfs]
 ? zrl_add_impl+0x31/0xb0 [zfs]
 dmu_read_uio_dbuf+0x47/0x60 [zfs]
 zfs_read+0x117/0x300 [zfs]
 zpl_read_common_iovec+0x99/0xe0 [zfs]
 zpl_iter_read_common+0xa8/0x100 [zfs]
 zpl_iter_read+0x58/0x80 [zfs]
 new_sync_read+0x122/0x1b0
 __vfs_read+0x29/0x40
 vfs_read+0xab/0x160
 ksys_pread64+0x6d/0xa0
 __x64_sys_pread64+0x1e/0x20
 do_syscall_64+0x57/0x190
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fadfa150bca
Code: ff ff ff eb c5 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 11 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 5e c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
RSP: 002b:00007ffe4bdc6868 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
RAX: ffffffffffffffda RBX: 00007fadeaf22080 RCX: 00007fadfa150bca
RDX: 0000000000002000 RSI: 00007fadeaf22080 RDI: 0000000000000011
RBP: 00007ffe4bdc68b0 R08: 000000000a00000d R09: 0000000000000000
R10: 0000000000030000 R11: 0000000000000246 R12: 00007fadf164e7e0
R13: 0000000000030000 R14: 0000000000002000 R15: 0000000000000001
Modules linked in: ufs msdos xfs xt_multiport isofs nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua zfs(PO) zunicode(PO) zlua(PO) zavl(PO) icp(PO) ipmi_ssif zcommon(PO) znvpair(PO) spl(O) kvm_intel kvm joydev input_leds ipmi_si video acpi_tad sch_fq_codel nf_log_ipv6 ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt nf_log_ipv4 nf_log_common ipt_REJECT nf_reject_ipv4 xt_LOG xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter ip6_tables iptable_filter bpfilter ipmi_devintf ipmi_msghandler msr ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear ast i2c_algo_bit drm_vram_helper crct10dif_pclmul ttm crc32_pclmul ghash_clmulni_intel hid_generic raid1 ixgbe drm_kms_helper syscopyarea aesni_intel sysfillrect sysimgblt usbhid uas xfrm_algo fb_sys_fops crypto_simd cryptd usb_storage hid ahci drm glue_helper dca nvme
 mdio libahci nvme_core
CR2: ffffffff001e0426
---[ end trace 0be7c89bce6193b6 ]---
RIP: 0010:LZ4_uncompress_unknownOutputSize+0x3e/0x270 [zfs]
nim-odoo commented 1 year ago

Incident 9

This one occurs the most frequently. Only reference I can find is https://github.com/openzfs/zfs/issues/10401

zfs-2.1.4-0ubuntu0.1
zfs-kmod-2.1.4-0ubuntu0.1
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0 
Oops: 0000 [#1] SMP NOPTI
CPU: 15 PID: 15851 Comm: z_wr_int_1 Tainted: P           O      5.15.0-48-generic #54-Ubuntu
Hardware name: To Be Filled By O.E.M. E3C252D4U-2T/E3C252D4U-2T/OVH, BIOS 1.31.OV02 01/28/2022
RIP: 0010:abd_is_gang+0x0/0x10 [zfs]
Code: 00 00 c3 cc cc cc cc 0f 1f 00 8b 07 83 e0 01 c3 cc cc cc cc 66 0f 1f 44 00 00 8b 07 c1 e8 05 83 e0 01 c3 cc cc cc cc 0f 1f 00 <8b> 07 c1 e8 06 83 e0 01 c3 cc cc cc cc 0f 1f 00 0f 1f 
44 00 00 55
RSP: 0018:ffffa17b821cfb80 EFLAGS: 00010206
RAX: 0000000000000000 RBX: 0000000000004000 RCX: ffffffffc0bb0470
RDX: 0000000000004000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffa17b821cfc00 R08: ffffa17b821cfc10 R09: 0000000000000002
R10: 0000000000000001 R11: ffff8bec5d21cda8 R12: 0000000000000000
R13: ffff8be3c5834000 R14: ffffa17b821cfc10 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff8bf25fdc0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000006f0810005 CR4: 0000000000770ee0
PKRU: 55555554
Call Trace:
 <TASK>
 ? abd_iterate_func+0x6b/0x100 [zfs]
 ? abd_advance_abd_iter+0x90/0x90 [zfs]
 abd_copy_to_buf_off+0x3c/0x60 [zfs]
 abd_copy_to_buf+0xe/0x20 [zfs]
 arc_buf_fill+0x1a6/0x4d0 [zfs]
 arc_buf_alloc_impl+0x1a2/0x230 [zfs]
 arc_read_done+0x160/0x540 [zfs]
 zio_done+0x3aa/0xe60 [zfs]
 zio_execute+0x8f/0xe0 [zfs]
 taskq_thread+0x21e/0x400 [spl]
 ? wake_up_q+0x90/0x90
 ? zio_execute_stack_check.constprop.0+0x20/0x20 [zfs]
 ? param_set_taskq_kick+0xf0/0xf0 [spl]
 kthread+0x127/0x150
 ? set_kthread_struct+0x50/0x50
 ret_from_fork+0x1f/0x30
 </TASK>
Modules linked in: xt_nat veth nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink xfrm_user xfrm_algo br_netfilter bridge stp llc overlay tls zfs(PO) zunicode(PO) zzstd(O) zlua(O) zav
l(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) isofs ip6t_REJECT nf_reject_ipv6 xt_hl ip6_tables ip6t_rt ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_def
rag_ipv4 nft_compat nft_counter nf_tables nfnetlink nls_iso8859_1 ipmi_ssif kvm_intel kvm joydev input_leds acpi_ipmi ipmi_si video acpi_tad dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sch_fq_codel ipmi_devintf ipmi_msghandler msr 
efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear ast i2c_algo_bit drm_vram_helper hid_generic
 drm_ttm_helper ttm drm_kms_helper usbhid crct10dif_pclmul syscopyarea crc32_pclmul
 raid1 sysfillrect ghash_clmulni_intel hid sysimgblt fb_sys_fops aesni_intel cec crypto_simd nvme rc_core ahci xhci_pci drm cryptd i40e nvme_core libahci xhci_pci_renesas
CR2: 0000000000000000
---[ end trace 3b74dfdcc49b02ac ]---
RIP: 0010:abd_is_gang+0x0/0x10 [zfs]
nim-odoo commented 1 year ago

Incident 10

This one didn't cause the system to hang, apparently led to a corrupted file. Occurrences: 3

zfs-2.1.4-0ubuntu0.1
zfs-kmod-2.1.4-0ubuntu0.1
VERIFY3(0 == remove_reference(hdr, NULL, tag)) failed (0 == 2)
VERIFY3(0 == remove_reference(hdr, NULL, tag)) failed (0 == 1)
PANIC at arc.c:3844:arc_buf_destroy()
PANIC at arc.c:3844:arc_buf_destroy()
Showing stack for process 1162989
CPU: 3 PID: 1162989 Comm: python3 Tainted: P           O      5.15.0-48-generic #54-Ubuntu
Showing stack for process 11026
Hardware name: To Be Filled By O.E.M. E3C252D4U-2T/E3C252D4U-2T/OVH, BIOS 1.31.OV02 01/28/2022
Call Trace:
 <TASK>
 show_stack+0x52/0x5c
 dump_stack_lvl+0x4a/0x63
 dump_stack+0x10/0x16
 spl_dumpstack+0x29/0x2f [spl]
 spl_panic+0xd1/0xe9 [spl]
 ? zfs_ereport_post+0xd7/0x1a0 [zfs]
 ? arc_untransform+0x74/0x90 [zfs]
 ? spl_kmem_cache_free+0xcd/0x140 [spl]
 ? __cond_resched+0x1a/0x50
 ? do_raw_spin_unlock+0x9/0x10 [zfs]
 ? __raw_spin_unlock+0x9/0x10 [zfs]
 ? aggsum_add+0x1b8/0x1d0 [zfs]
 arc_buf_destroy+0xed/0xf0 [zfs]
 dbuf_destroy+0x31/0x3d0 [zfs]
 ? __raw_spin_unlock+0x9/0x10 [zfs]
 dbuf_rele_and_unlock+0x98/0x540 [zfs]
 ? __cond_resched+0x1a/0x50
 ? do_raw_spin_unlock+0x9/0x10 [zfs]
 ? __raw_spin_unlock+0x9/0x10 [zfs]
 ? __cond_resched+0x1a/0x50
 ? down_read+0x13/0xa0
 ? wmsum_add+0xe/0x20 [zfs]
 ? dbuf_read+0x1e6/0x5b0 [zfs]
 dnode_rele_and_unlock+0xd2/0xf0 [zfs]
 dnode_rele+0x39/0x50 [zfs]
 dmu_bonus_hold+0x66/0x90 [zfs]
 sa_buf_hold+0xe/0x20 [zfs]
 zfs_zget+0x5b/0x280 [zfs]
 ? zfs_match_find.constprop.0+0x76/0x100 [zfs]
 zfs_dirent_lock+0x424/0x570 [zfs]
 zfs_dirlook+0x91/0x2d0 [zfs]
 zfs_lookup+0x25e/0x410 [zfs]
 zpl_lookup+0xcf/0x230 [zfs]
 ? get_acl+0x18/0x50
 __lookup_slow+0x85/0x150
 walk_component+0x145/0x1c0
 ? path_init+0x2c0/0x3f0
 path_lookupat+0x6e/0x1c0
 ? avl_find+0x68/0xd0 [zavl]
 filename_lookup+0xcf/0x1d0
 ? __check_object_size+0x1d/0x30
 ? strncpy_from_user+0x44/0x150
 ? getname_flags.part.0+0x4c/0x1b0
 user_path_at_empty+0x3f/0x60
 vfs_statx+0x7a/0x130
 __do_sys_newfstatat+0x36/0x70
 ? __secure_computing+0xa9/0x120
 ? syscall_trace_enter.constprop.0+0xa7/0x1c0
 __x64_sys_newfstatat+0x1c/0x30
 do_syscall_64+0x59/0xc0
 ? exit_to_user_mode_loop+0x10d/0x160
 ? exit_to_user_mode_prepare+0x37/0xb0
 ? syscall_exit_to_user_mode+0x27/0x50
 ? __x64_sys_close+0x11/0x50
 ? do_syscall_64+0x69/0xc0
 ? exit_to_user_mode_prepare+0x37/0xb0
 ? syscall_exit_to_user_mode+0x27/0x50
 ? do_syscall_64+0x69/0xc0
 ? exc_page_fault+0x89/0x170
 entry_SYSCALL_64_after_hwframe+0x61/0xcb
RIP: 0033:0x7f1937dfeeee
Code: 48 89 f2 b9 00 01 00 00 48 89 fe bf 9c ff ff ff e9 07 00 00 00 0f 1f 80 00 00 00 00 f3 0f 1e fa 41 89 ca b8 06 01 00 00 0f 05 <3d> 00 f0 ff ff 77 0b 31 c0 c3 0f 1f 84 00 00 00 00 00 48 8b 15 09
RSP: 002b:00007fff14cf79a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000106
RAX: ffffffffffffffda RBX: 000055e201b98a40 RCX: 00007f1937dfeeee
RDX: 00007fff14cf7a40 RSI: 00007f1937015860 RDI: 00000000ffffff9c
RBP: 000055e201e2c340 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff14cf7a40
R13: 00007f1937c87f10 R14: 00000000ffffff9c R15: 000055e201b98bf0
 </TASK>
CPU: 2 PID: 11026 Comm: z_wr_int_2 Tainted: P           O      5.15.0-48-generic #54-Ubuntu
Hardware name: To Be Filled By O.E.M. E3C252D4U-2T/E3C252D4U-2T/OVH, BIOS 1.31.OV02 01/28/2022
Call Trace:
 <TASK>
 show_stack+0x52/0x5c
 dump_stack_lvl+0x4a/0x63
 dump_stack+0x10/0x16
 spl_dumpstack+0x29/0x2f [spl]
 spl_panic+0xd1/0xe9 [spl]
 ? spl_kmem_cache_free+0xcd/0x140 [spl]
 ? kmem_cache_free+0x26c/0x290
 ? percpu_counter_add+0xf/0x20 [spl]
 ? percpu_counter_dec+0x10/0x20 [spl]
 ? spl_kmem_free_impl+0x29/0x40 [spl]
 arc_buf_destroy+0xed/0xf0 [zfs]
 dbuf_issue_final_prefetch_done+0x30/0x40 [zfs]
 arc_read_done+0x247/0x540 [zfs]
 zio_done+0x3aa/0xe60 [zfs]
 zio_execute+0x8f/0xe0 [zfs]
 taskq_thread+0x21e/0x400 [spl]
 ? wake_up_q+0x90/0x90
 ? zio_execute_stack_check.constprop.0+0x20/0x20 [zfs]
 ? param_set_taskq_kick+0xf0/0xf0 [spl]
 kthread+0x127/0x150
 ? set_kthread_struct+0x50/0x50
 ret_from_fork+0x1f/0x30
 </TASK>

Few seconds later:

INFO: task z_wr_int_2:11026 blocked for more than 241 seconds.
      Tainted: P           O      5.15.0-48-generic #54-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:z_wr_int_2      state:D stack:    0 pid:11026 ppid:     2 flags:0x00004000
Call Trace:
 <TASK>
 __schedule+0x23d/0x5a0
 schedule+0x4e/0xc0
 spl_panic+0xe7/0xe9 [spl]
 ? spl_kmem_cache_free+0xcd/0x140 [spl]
 ? kmem_cache_free+0x26c/0x290
 ? percpu_counter_add+0xf/0x20 [spl]
 ? percpu_counter_dec+0x10/0x20 [spl]
 ? spl_kmem_free_impl+0x29/0x40 [spl]
 arc_buf_destroy+0xed/0xf0 [zfs]
 dbuf_issue_final_prefetch_done+0x30/0x40 [zfs]
 arc_read_done+0x247/0x540 [zfs]
 zio_done+0x3aa/0xe60 [zfs]
 zio_execute+0x8f/0xe0 [zfs]
 taskq_thread+0x21e/0x400 [spl]
 ? wake_up_q+0x90/0x90
 ? zio_execute_stack_check.constprop.0+0x20/0x20 [zfs]
 ? param_set_taskq_kick+0xf0/0xf0 [spl]
 kthread+0x127/0x150
 ? set_kthread_struct+0x50/0x50
 ret_from_fork+0x1f/0x30
 </TASK>
INFO: task python3:1162989 blocked for more than 241 seconds.
      Tainted: P           O      5.15.0-48-generic #54-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:python3         state:D stack:    0 pid:1162989 ppid:1162987 flags:0x00004222
Call Trace:
 <TASK>
 __schedule+0x23d/0x5a0
 schedule+0x4e/0xc0
 spl_panic+0xe7/0xe9 [spl]
 ? zfs_ereport_post+0xd7/0x1a0 [zfs]
 ? arc_untransform+0x74/0x90 [zfs]
 ? spl_kmem_cache_free+0xcd/0x140 [spl]
 ? __cond_resched+0x1a/0x50
 ? do_raw_spin_unlock+0x9/0x10 [zfs]
 ? __raw_spin_unlock+0x9/0x10 [zfs]
 ? aggsum_add+0x1b8/0x1d0 [zfs]
 arc_buf_destroy+0xed/0xf0 [zfs]
 dbuf_destroy+0x31/0x3d0 [zfs]
 ? __raw_spin_unlock+0x9/0x10 [zfs]
 dbuf_rele_and_unlock+0x98/0x540 [zfs]
 ? __cond_resched+0x1a/0x50
 ? do_raw_spin_unlock+0x9/0x10 [zfs]
 ? __raw_spin_unlock+0x9/0x10 [zfs]
 ? __cond_resched+0x1a/0x50
 ? down_read+0x13/0xa0
 ? wmsum_add+0xe/0x20 [zfs]
 ? dbuf_read+0x1e6/0x5b0 [zfs]
 dnode_rele_and_unlock+0xd2/0xf0 [zfs]
 dnode_rele+0x39/0x50 [zfs]
 dmu_bonus_hold+0x66/0x90 [zfs]
 sa_buf_hold+0xe/0x20 [zfs]
 zfs_zget+0x5b/0x280 [zfs]
 ? zfs_match_find.constprop.0+0x76/0x100 [zfs]
 zfs_dirent_lock+0x424/0x570 [zfs]
 zfs_dirlook+0x91/0x2d0 [zfs]
 zfs_lookup+0x25e/0x410 [zfs]
 zpl_lookup+0xcf/0x230 [zfs]
 ? get_acl+0x18/0x50
 __lookup_slow+0x85/0x150
 walk_component+0x145/0x1c0
 ? path_init+0x2c0/0x3f0
 path_lookupat+0x6e/0x1c0
 ? avl_find+0x68/0xd0 [zavl]
 filename_lookup+0xcf/0x1d0
 ? __check_object_size+0x1d/0x30
 ? strncpy_from_user+0x44/0x150
 ? getname_flags.part.0+0x4c/0x1b0
 user_path_at_empty+0x3f/0x60
 vfs_statx+0x7a/0x130
 __do_sys_newfstatat+0x36/0x70
 ? __secure_computing+0xa9/0x120
 ? syscall_trace_enter.constprop.0+0xa7/0x1c0
 __x64_sys_newfstatat+0x1c/0x30
 do_syscall_64+0x59/0xc0
 ? exit_to_user_mode_loop+0x10d/0x160
 ? exit_to_user_mode_prepare+0x37/0xb0
 ? syscall_exit_to_user_mode+0x27/0x50
 ? __x64_sys_close+0x11/0x50
 ? do_syscall_64+0x69/0xc0
 ? exit_to_user_mode_prepare+0x37/0xb0
 ? syscall_exit_to_user_mode+0x27/0x50
 ? do_syscall_64+0x69/0xc0
 ? exc_page_fault+0x89/0x170
 entry_SYSCALL_64_after_hwframe+0x61/0xcb
RIP: 0033:0x7f1937dfeeee
RSP: 002b:00007fff14cf79a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000106
RAX: ffffffffffffffda RBX: 000055e201b98a40 RCX: 00007f1937dfeeee
RDX: 00007fff14cf7a40 RSI: 00007f1937015860 RDI: 00000000ffffff9c
RBP: 000055e201e2c340 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff14cf7a40
R13: 00007f1937c87f10 R14: 00000000ffffff9c R15: 000055e201b98bf0
 </TASK>
nim-odoo commented 1 year ago

Incident 11

Occurrences: 1

zfs-2.1.4-0ubuntu0.1
zfs-kmod-2.1.4-0ubuntu0.1
RSP: 0018:ffffb551cf997528 EFLAGS: 00010282
RAX: e224efda58e958d5 RBX: 0000000000004000 RCX: e224efda58e978d5
RDX: 0000000000363148 RSI: 0000000000042c20 RDI: 000039e21fc12140
RBP: ffffb551cf997568 R08: ffffd551bf6d2140 R09: 0000000000000002
R10: ffffb551cf997738 R11: 0000000000000000 R12: ffff9b608493e000
R13: 0000000000000000 R14: 0000000000042c20 R15: 0000000000042c20
FS:  00007f6314e8b740(0000) GS:ffff9b6f9fac0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f630f8a5024 CR3: 000000041646a004 CR4: 0000000000770ee0
PKRU: 55555554
Call Trace:
 <TASK>
 ? spl_kmem_cache_alloc+0x3b/0x100 [spl]
 spl_kmem_cache_alloc+0x3b/0x100 [spl]
 zio_buf_alloc+0x30/0x80 [zfs]
 arc_get_data_buf+0x48/0x60 [zfs]
 arc_buf_alloc_impl+0xc9/0x230 [zfs]
 arc_read+0x106b/0x15c0 [zfs]
 ? dbuf_rele_and_unlock+0x540/0x540 [zfs]
 ? __cond_resched+0x1a/0x50
 ? do_raw_spin_unlock+0x9/0x10 [zfs]
 dbuf_read_impl.constprop.0+0x2f2/0x490 [zfs]
 dbuf_read+0x1ba/0x5b0 [zfs]
 dnode_hold_impl+0x113/0xb00 [zfs]
 ? __cond_resched+0x1a/0x50
 ? mutex_lock+0x13/0x50
 ? zfs_znode_hold_enter+0x123/0x170 [zfs]
 dnode_hold+0x1b/0x30 [zfs]
 dmu_bonus_hold+0x38/0x90 [zfs]
 sa_buf_hold+0xe/0x20 [zfs]
 zfs_zget+0x5b/0x280 [zfs]
 ? zfs_match_find.constprop.0+0x76/0x100 [zfs]
 zfs_dirent_lock+0x424/0x570 [zfs]
 zfs_dirlook+0x91/0x2d0 [zfs]
 zfs_lookup+0x25e/0x410 [zfs]
 zpl_lookup+0xcf/0x230 [zfs]
 lookup_open.isra.0+0x2c1/0x6f0
 open_last_lookups+0x198/0x3f0
 ? path_init+0x2c0/0x3f0
 path_openat+0x8d/0x2b0
 ? lru_cache_add_inactive_or_unevictable+0x2e/0xe0
 ? page_add_new_anon_rmap+0x69/0x100
 do_filp_open+0xb2/0x160
 ? __check_object_size+0x1d/0x30
 ? alloc_fd+0x53/0x180
 do_sys_openat2+0x9f/0x160
 __x64_sys_openat+0x55/0x90
 do_syscall_64+0x59/0xc0
 ? irqentry_exit+0x1d/0x30
 ? exc_page_fault+0x89/0x170
 entry_SYSCALL_64_after_hwframe+0x61/0xcb
RIP: 0033:0x7f631516a4fc
Code: 24 18 31 c0 41 83 e2 40 75 44 89 f0 25 00 00 41 00 3d 00 00 41 00 74 36 44 89 c2 4c 89 ce bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 44 48 8b 4c 24 18 64 48 33 0c 25 28 00 00 00
RSP: 002b:00007ffcc5ba4050 EFLAGS: 00000287 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 00007f6314db6770 RCX: 00007f631516a4fc
RDX: 0000000000090800 RSI: 00007f630f8a5ad0 RDI: 00000000ffffff9c
RBP: 00000000018092f0 R08: 0000000000090800 R09: 00007f630f8a5ad0
R10: 0000000000000000 R11: 0000000000000287 R12: 00007f630f8a3500
R13: 00007f630f8a5ad0 R14: 0000000002304db0 R15: 00007f6314db6770
 </TASK>
Modules linked in: xt_nat veth nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink xfrm_user xfrm_algo br_netfilter bridge stp llc overlay tls zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) isofs ip6t_REJECT nf_reject_ipv6 xt_hl ip6_tables ip6t_rt ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nft_counter nf_tables nfnetlink nls_iso8859_1 ipmi_ssif kvm_intel kvm joydev input_leds acpi_ipmi ipmi_si video acpi_tad dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sch_fq_codel ipmi_devintf ipmi_msghandler msr efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear hid_generic usbhid hid ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt
 fb_sys_fops crct10dif_pclmul crc32_pclmul cec ghash_clmulni_intel raid1 rc_core aesni_intel nvme ahci crypto_simd xhci_pci i40e cryptd drm libahci nvme_core xhci_pci_renesas
---[ end trace 8be0eb0694452b25 ]---
RIP: 0010:kmem_cache_alloc+0xfd/0x2f0
Code: 8b 50 08 49 8b 00 49 83 78 10 00 48 89 45 c8 0f 84 96 01 00 00 48 85 c0 0f 84 8d 01 00 00 41 8b 4c 24 28 49 8b 3c 24 48 01 c1 <48> 8b 19 48 89 ce 49 33 9c 24 b8 00 00 00 48 8d 4a 01 48 0f ce 48
RSP: 0018:ffffb551e7303770 EFLAGS: 00010282
RAX: e224efda58e958d5 RBX: 0000000000004000 RCX: e224efda58e978d5
RDX: 0000000000363148 RSI: 0000000000042c20 RDI: 000039e21fc12140
RBP: ffffb551e73037b0 R08: ffffd551bf6d2140 R09: 0000000000000001
R10: ffff9b67b825d420 R11: 0000000000000000 R12: ffff9b608493e000
R13: 0000000000000000 R14: 0000000000042c20 R15: 0000000000042c20
FS:  00007fab7b881a40(0000) GS:ffff9b6f9fac0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fab39f27000 CR3: 0000000a1bc12003 CR4: 0000000000770ee0
PKRU: 55555554
Call Trace:
 <TASK>
 ? spl_kmem_cache_alloc+0x3b/0x100 [spl]
 spl_kmem_cache_alloc+0x3b/0x100 [spl]
 zio_data_buf_alloc+0x30/0x80 [zfs]
 arc_get_data_buf+0x38/0x60 [zfs]
 arc_buf_alloc_impl+0xc9/0x230 [zfs]
 arc_alloc_buf+0x74/0xe0 [zfs]
 dbuf_fix_old_data.constprop.0+0x289/0x2a0 [zfs]
 ? spl_kmem_alloc_impl+0x80/0xd0 [spl]
 dbuf_dirty+0x5fe/0x6e0 [zfs]
 ? dbuf_read+0x366/0x5b0 [zfs]
 dmu_buf_will_dirty_impl+0xbb/0x120 [zfs]
 dmu_buf_will_dirty+0x16/0x20 [zfs]
 dbuf_new_size+0x64/0x1a0 [zfs]
 dnode_set_blksz+0x2a5/0x2d0 [zfs]
 dmu_object_set_blocksize+0x56/0xa0 [zfs]
 zfs_grow_blocksize+0x5a/0xa0 [zfs]
 zfs_write+0x8f7/0xdb0 [zfs]
 ? shmem_getpage_gfp+0xcf/0x860
 ? generic_write_checks+0x65/0xc0
 zpl_iter_write+0xe7/0x130 [zfs]
 new_sync_write+0x114/0x1b0
 vfs_write+0x1d5/0x270
 __x64_sys_pwrite64+0x96/0xc0
 do_syscall_64+0x59/0xc0
 ? handle_mm_fault+0xd8/0x2c0
 ? do_user_addr_fault+0x1e7/0x670
 ? exit_to_user_mode_prepare+0x37/0xb0
 ? irqentry_exit_to_user_mode+0x9/0x20
 ? irqentry_exit+0x1d/0x30
 ? exc_page_fault+0x89/0x170
 entry_SYSCALL_64_after_hwframe+0x61/0xcb
RIP: 0033:0x7fab7e2429aa
Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb ba 0f 1f 00 f3 0f 1e fa 49 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 12 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 5e c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
RSP: 002b:00007ffdee58bc48 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
RAX: ffffffffffffffda RBX: 0000558b75fcfb68 RCX: 00007fab7e2429aa
RDX: 0000000000002000 RSI: 00007fab39f26200 RDI: 00000000000000fa
RBP: 00007ffdee58bca0 R08: 000000000a00000b R09: 0000558b75fccde0
R10: 0000000000004000 R11: 0000000000000246 R12: 0000000000004000
R13: 00007fab39f26200 R14: 00007fab7b881940 R15: 0000000000002000
 </TASK>
Modules linked in: xt_nat veth nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink xfrm_user xfrm_algo br_netfilter bridge stp llc overlay tls zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) isofs ip6t_REJECT nf_reject_ipv6 xt_hl ip6_tables ip6t_rt ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nft_counter nf_tables nfnetlink nls_iso8859_1 ipmi_ssif kvm_intel kvm joydev input_leds acpi_ipmi ipmi_si video acpi_tad dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sch_fq_codel ipmi_devintf ipmi_msghandler msr efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear hid_generic usbhid hid ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt
 fb_sys_fops
 crct10dif_pclmul crc32_pclmul cec ghash_clmulni_intel raid1 rc_core aesni_intel nvme ahci crypto_simd xhci_pci i40e cryptd drm libahci nvme_core xhci_pci_renesas
---[ end trace 8be0eb0694452b26 ]---
RIP: 0010:kmem_cache_alloc+0xfd/0x2f0
Code: 8b 50 08 49 8b 00 49 83 78 10 00 48 89 45 c8 0f 84 96 01 00 00 48 85 c0 0f 84 8d 01 00 00 41 8b 4c 24 28 49 8b 3c 24 48 01 c1 <48> 8b 19 48 89 ce 49 33 9c 24 b8 00 00 00 48 8d 4a 01 48 0f ce 48
RSP: 0018:ffffb551cf997528 EFLAGS: 00010282
RAX: e224efda58e958d5 RBX: 0000000000004000 RCX: e224efda58e978d5
RDX: 0000000000363148 RSI: 0000000000042c20 RDI: 000039e21fc12140
RBP: ffffb551cf997568 R08: ffffd551bf6d2140 R09: 0000000000000002
R10: ffffb551cf997738 R11: 0000000000000000 R12: ffff9b608493e000
R13: 0000000000000000 R14: 0000000000042c20 R15: 0000000000042c20
FS:  00007fab7b881a40(0000) GS:ffff9b6f9fac0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fab39f27000 CR3: 0000000a1bc12003 CR4: 0000000000770ee0
PKRU: 55555554
general protection fault, probably for non-canonical address 0xe224efda58e978d5: 0000 [#3] SMP NOPTI
CPU: 3 PID: 55587 Comm: python3 Tainted: P      D    O      5.15.0-48-generic #54-Ubuntu
Hardware name: To Be Filled By O.E.M. E3C252D4U-2T/E3C252D4U-2T/OVH, BIOS 1.31.OV02 01/28/2022
RIP: 0010:kmem_cache_alloc+0xfd/0x2f0
Code: 8b 50 08 49 8b 00 49 83 78 10 00 48 89 45 c8 0f 84 96 01 00 00 48 85 c0 0f 84 8d 01 00 00 41 8b 4c 24 28 49 8b 3c 24 48 01 c1 <48> 8b 19 48 89 ce 49 33 9c 24 b8 00 00 00 48 8d 4a 01 48 0f ce 48
RSP: 0018:ffffb551c2993420 EFLAGS: 00010282
RAX: e224efda58e958d5 RBX: 0000000000004000 RCX: e224efda58e978d5
RDX: 0000000000363148 RSI: 0000000000042c20 RDI: 000039e21fc12140
RBP: ffffb551c2993460 R08: ffffd551bf6d2140 R09: 0000000000000002
R10: ffffb551c2993630 R11: 0000000000000000 R12: ffff9b608493e000
R13: 0000000000000000 R14: 0000000000042c20 R15: 0000000000042c20
FS:  00007fc8c09ff640(0000) GS:ffff9b6f9fac0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fab7e1b75a0 CR3: 00000003670d6001 CR4: 0000000000770ee0
PKRU: 55555554
Call Trace:
 <TASK>
 ? spl_kmem_cache_alloc+0x3b/0x100 [spl]
 spl_kmem_cache_alloc+0x3b/0x100 [spl]
 zio_buf_alloc+0x30/0x80 [zfs]
 arc_get_data_buf+0x48/0x60 [zfs]
 arc_buf_alloc_impl+0xc9/0x230 [zfs]
 arc_read+0x106b/0x15c0 [zfs]
 ? dbuf_rele_and_unlock+0x540/0x540 [zfs]
 ? __cond_resched+0x1a/0x50
 ? do_raw_spin_unlock+0x9/0x10 [zfs]
 dbuf_read_impl.constprop.0+0x2f2/0x490 [zfs]
 dbuf_read+0x1ba/0x5b0 [zfs]
 dnode_hold_impl+0x113/0xb00 [zfs]
 ? __cond_resched+0x1a/0x50
 ? mutex_lock+0x13/0x50
 ? zfs_znode_hold_enter+0x123/0x170 [zfs]
 dnode_hold+0x1b/0x30 [zfs]
 dmu_bonus_hold+0x38/0x90 [zfs]
 sa_buf_hold+0xe/0x20 [zfs]
 zfs_zget+0x5b/0x280 [zfs]
 ? zfs_match_find.constprop.0+0x76/0x100 [zfs]
 zfs_dirent_lock+0x424/0x570 [zfs]
 zfs_dirlook+0x91/0x2d0 [zfs]
 zfs_lookup+0x25e/0x410 [zfs]
 zpl_lookup+0xcf/0x230 [zfs]
 __lookup_slow+0x85/0x150
 walk_component+0x145/0x1c0
 link_path_walk.part.0.constprop.0+0x23f/0x3a0
 ? path_init+0x2c0/0x3f0
 path_lookupat+0x3e/0x1c0
 filename_lookup+0xcf/0x1d0
 ? __check_object_size+0x1d/0x30
 ? strncpy_from_user+0x44/0x150
 ? getname_flags.part.0+0x4c/0x1b0
 user_path_at_empty+0x3f/0x60
 vfs_statx+0x7a/0x130
 __do_sys_newfstatat+0x36/0x70
 ? __secure_computing+0xa9/0x120
 ? syscall_trace_enter.constprop.0+0xa7/0x1c0
 __x64_sys_newfstatat+0x1c/0x30
 do_syscall_64+0x59/0xc0
 ? exit_to_user_mode_prepare+0x37/0xb0
 ? syscall_exit_to_user_mode+0x27/0x50
 ? __x64_sys_write+0x19/0x20
 ? do_syscall_64+0x69/0xc0
 entry_SYSCALL_64_after_hwframe+0x61/0xcb
RIP: 0033:0x7fc8cbf13eee
Code: 48 89 f2 b9 00 01 00 00 48 89 fe bf 9c ff ff ff e9 07 00 00 00 0f 1f 80 00 00 00 00 f3 0f 1e fa 41 89 ca b8 06 01 00 00 0f 05 <3d> 00 f0 ff ff 77 0b 31 c0 c3 0f 1f 84 00 00 00 00 00 
48 8b 15 09
RSP: 002b:00007fc8c09fbdb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000106
RAX: ffffffffffffffda RBX: 00005588a8277a40 RCX: 00007fc8cbf13eee
RDX: 00007fc8c09fbe50 RSI: 00007fc8baf3dc70 RDI: 00000000ffffff9c
RBP: 00005588a9e1c430 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fc8c09fbe50
R13: 00007fc8cbd9bfb0 R14: 00000000ffffff9c R15: 00005588a8277bf0
 </TASK>
Modules linked in: xt_nat veth nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink xfrm_user xfrm_algo br_netfilter bridge stp llc overlay tls zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) isofs ip6t_REJECT nf_reject_ipv6 xt_hl ip6_tables ip6t_rt ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nft_counter nf_tables nfnetlink nls_iso8859_1 ipmi_ssif kvm_intel kvm joydev input_leds acpi_ipmi ipmi_si video acpi_tad dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sch_fq_codel ipmi_devintf ipmi_msghandler msr efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear hid_generic usbhid hid ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt
 fb_sys_fops crct10dif_pclmul crc32_pclmul cec ghash_clmulni_intel raid1 rc_core aesni_intel nvme ahci crypto_simd xhci_pci i40e cryptd drm libahci nvme_core xhci_pci_renesas
---[ end trace 8be0eb0694452b27 ]---
RIP: 0010:kmem_cache_alloc+0xfd/0x2f0
Code: 8b 50 08 49 8b 00 49 83 78 10 00 48 89 45 c8 0f 84 96 01 00 00 48 85 c0 0f 84 8d 01 00 00 41 8b 4c 24 28 49 8b 3c 24 48 01 c1 <48> 8b 19 48 89 ce 49 33 9c 24 b8 00 00 00 48 8d 4a 01 48 0f ce 48
RSP: 0018:ffffb551cf997528 EFLAGS: 00010282
RAX: e224efda58e958d5 RBX: 0000000000004000 RCX: e224efda58e978d5
RDX: 0000000000363148 RSI: 0000000000042c20 RDI: 000039e21fc12140
RBP: ffffb551cf997568 R08: ffffd551bf6d2140 R09: 0000000000000002
R10: ffffb551cf997738 R11: 0000000000000000 R12: ffff9b608493e000
R13: 0000000000000000 R14: 0000000000042c20 R15: 0000000000042c20
FS:  00007fc8c09ff640(0000) GS:ffff9b6f9fac0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fab7e1b75a0 CR3: 00000003670d6001 CR4: 0000000000770ee0
PKRU: 55555554

Moreover:

status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
nim-odoo commented 1 year ago

Incident 12

Happens more and more frequently since layout change.

zfs-0.8.3-1ubuntu12.12
zfs-kmod-0.8.3-1ubuntu12.14
INFO: task spl_delay_taskq:936 blocked for more than 120 seconds.
      Tainted: P           O      5.4.0-126-generic #142-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
spl_delay_taskq D    0   936      2 0x80004000
Call Trace:
 __schedule+0x2e3/0x740
 ? enqueue_entity+0x128/0x660
 schedule+0x42/0xb0
 schedule_timeout+0x10e/0x160
 wait_for_completion+0xb1/0x120
 ? wake_up_q+0x70/0x70
 call_usermodehelper_exec+0x148/0x180
 call_usermodehelper+0x93/0xb0
 zfsctl_snapshot_unmount+0xc8/0x190 [zfs]
 snapentry_expire+0x3d/0xc0 [zfs]
 taskq_thread+0x245/0x430 [spl]
 ? __switch_to_asm+0x40/0x70
 ? wake_up_q+0x70/0x70
 kthread+0x104/0x140
 ? task_done+0x90/0x90 [spl]
 ? kthread_park+0x90/0x90
 ret_from_fork+0x35/0x40
INFO: task saas_exporter:381942 blocked for more than 121 seconds.
      Tainted: P           O      5.4.0-126-generic #142-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
saas_exporter   D    0 381942 381939 0x00004000
Call Trace:
 __schedule+0x2e3/0x740
 schedule+0x42/0xb0
 taskq_wait_id+0x6e/0xb0 [spl]
 ? __wake_up_pollfree+0x40/0x40
 taskq_cancel_id+0xd8/0x100 [spl]
 zfsctl_snapshot_unmount_cancel+0x1e/0x40 [zfs]
 zfsctl_snapshot_unmount_delay+0x48/0xa0 [zfs]
 zpl_revalidate+0xbd/0x120 [zfs]
 lookup_fast+0x281/0x300
 walk_component+0x48/0x360
 ? link_path_walk.part.0+0x2a2/0x550
 path_lookupat.isra.0+0x80/0x230
 ? atomic_dec+0xd/0x20 [spl]
 filename_lookup+0xae/0x170
 ? __check_object_size+0x13f/0x150
 ? strncpy_from_user+0x4c/0x150
 user_path_at_empty+0x3a/0x50
 vfs_statx+0x7d/0xe0
 __do_sys_newlstat+0x3e/0x80
 __x64_sys_newlstat+0x16/0x20
 do_syscall_64+0x57/0x190
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f0801bb45aa
Code: Bad RIP value.
RSP: 002b:00007ffd1e07bf78 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
RAX: ffffffffffffffda RBX: 00007f07fed6b7c0 RCX: 00007f0801bb45aa
RDX: 00007ffd1e07bff0 RSI: 00007ffd1e07bff0 RDI: 00007f07fd54b110
RBP: 00000000ffffff9c R08: 0000000000000001 R09: 0000000000000038
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000001a71600
R13: 00007ffd1e07bff0 R14: 00007f07fed6b7c0 R15: 0000000001d421f8
INFO: task umount:381977 blocked for more than 121 seconds.
      Tainted: P           O      5.4.0-126-generic #142-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
umount          D    0 381977 381866 0x00004000
Call Trace:
 __schedule+0x2e3/0x740
 ? __switch_to_asm+0x40/0x70
 ? __switch_to_asm+0x34/0x70
 schedule+0x42/0xb0
 rwsem_down_write_slowpath+0x244/0x4d0
 ? __radix_tree_delete+0x8f/0xf0
 down_write+0x41/0x50
 zfsctl_destroy+0x5a/0xc0 [zfs]
 zfs_preumount+0x2f/0x70 [zfs]
 zpl_kill_sb+0x13/0x20 [zfs]
 deactivate_locked_super+0x3b/0x80
 deactivate_super+0x3e/0x50
 cleanup_mnt+0x109/0x160
 __cleanup_mnt+0x12/0x20
 task_work_run+0x8f/0xb0
 exit_to_usermode_loop+0x131/0x160
 do_syscall_64+0x163/0x190
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7ff72980d19b
Code: Bad RIP value.
RSP: 002b:00007ffdc55795b8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007ff72993f204 RCX: 00007ff72980d19b
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000055b3f873ae30
RBP: 000055b3f8720960 R08: 0000000000000000 R09: 00007ffdc5578360
R10: 00007ff72992b379 R11: 0000000000000246 R12: 000055b3f873ae30
R13: 0000000000000000 R14: 000055b3f8720a58 R15: 000055b3f8720b70
INFO: task spl_delay_taskq:381978 blocked for more than 121 seconds.
      Tainted: P           O      5.4.0-126-generic #142-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
spl_delay_taskq D    0 381978      2 0x80004000
Call Trace:
 __schedule+0x2e3/0x740
 schedule+0x42/0xb0
 schedule_timeout+0x10e/0x160
 wait_for_completion+0xb1/0x120
 ? wake_up_q+0x70/0x70
 call_usermodehelper_exec+0x148/0x180
 call_usermodehelper+0x93/0xb0
 zfsctl_snapshot_unmount+0xc8/0x190 [zfs]
 snapentry_expire+0x3d/0xc0 [zfs]
 taskq_thread+0x245/0x430 [spl]
 ? wake_up_q+0x70/0x70
 kthread+0x104/0x140
 ? task_done+0x90/0x90 [spl]
 ? kthread_park+0x90/0x90
 ret_from_fork+0x35/0x40
INFO: task spl_delay_taskq:381979 blocked for more than 121 seconds.
      Tainted: P           O      5.4.0-126-generic #142-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
spl_delay_taskq D    0 381979      2 0x80004000
Call Trace:
 __schedule+0x2e3/0x740
 schedule+0x42/0xb0
 rwsem_down_read_slowpath+0x16c/0x4a0
 ? autoremove_wake_function+0x12/0x40
 down_read+0x85/0xa0
 zfsctl_snapshot_unmount+0x72/0x190 [zfs]
 snapentry_expire+0x3d/0xc0 [zfs]
 taskq_thread+0x245/0x430 [spl]
 ? wake_up_q+0x70/0x70
 kthread+0x104/0x140
 ? task_done+0x90/0x90 [spl]
 ? kthread_park+0x90/0x90
 ret_from_fork+0x35/0x40
INFO: task spl_delay_taskq:381981 blocked for more than 122 seconds.
      Tainted: P           O      5.4.0-126-generic #142-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
spl_delay_taskq D    0 381981      2 0x80004000
Call Trace:
 __schedule+0x2e3/0x740
 schedule+0x42/0xb0
 schedule_timeout+0x10e/0x160
 wait_for_completion+0xb1/0x120
 ? wake_up_q+0x70/0x70
 call_usermodehelper_exec+0x148/0x180
 call_usermodehelper+0x93/0xb0
 zfsctl_snapshot_unmount+0xc8/0x190 [zfs]
 snapentry_expire+0x3d/0xc0 [zfs]
 taskq_thread+0x245/0x430 [spl]
 ? wake_up_q+0x70/0x70
 kthread+0x104/0x140
 ? task_done+0x90/0x90 [spl]
 ? kthread_park+0x90/0x90
 ret_from_fork+0x35/0x40
INFO: task spl_delay_taskq:381984 blocked for more than 122 seconds.
      Tainted: P           O      5.4.0-126-generic #142-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
spl_delay_taskq D    0 381984      2 0x80004000
Call Trace:
 __schedule+0x2e3/0x740
 ? wait_for_completion+0xbc/0x120
 ? wake_up_q+0x70/0x70
 schedule+0x42/0xb0
 rwsem_down_read_slowpath+0x16c/0x4a0
 down_read+0x85/0xa0
 snapentry_expire+0x51/0xc0 [zfs]
 taskq_thread+0x245/0x430 [spl]
 ? wake_up_q+0x70/0x70
 kthread+0x104/0x140
 ? task_done+0x90/0x90 [spl]
 ? kthread_park+0x90/0x90
 ret_from_fork+0x35/0x40
INFO: task spl_delay_taskq:381986 blocked for more than 122 seconds.
      Tainted: P           O      5.4.0-126-generic #142-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
spl_delay_taskq D    0 381986      2 0x80004000
Call Trace:
 __schedule+0x2e3/0x740
 schedule+0x42/0xb0
 schedule_timeout+0x10e/0x160
 wait_for_completion+0xb1/0x120
 ? wake_up_q+0x70/0x70
 call_usermodehelper_exec+0x148/0x180
 call_usermodehelper+0x93/0xb0
 zfsctl_snapshot_unmount+0xc8/0x190 [zfs]
 snapentry_expire+0x3d/0xc0 [zfs]
 taskq_thread+0x245/0x430 [spl]
 ? __switch_to_asm+0x40/0x70
 ? wake_up_q+0x70/0x70
 kthread+0x104/0x140
 ? task_done+0x90/0x90 [spl]
 ? kthread_park+0x90/0x90
 ret_from_fork+0x35/0x40
INFO: task spl_delay_taskq:381989 blocked for more than 122 seconds.
      Tainted: P           O      5.4.0-126-generic #142-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
spl_delay_taskq D    0 381989      2 0x80004000
Call Trace:
 __schedule+0x2e3/0x740
 schedule+0x42/0xb0
 schedule_timeout+0x10e/0x160
 wait_for_completion+0xb1/0x120
 ? wake_up_q+0x70/0x70
 call_usermodehelper_exec+0x148/0x180
 call_usermodehelper+0x93/0xb0
 zfsctl_snapshot_unmount+0xc8/0x190 [zfs]
 snapentry_expire+0x3d/0xc0 [zfs]
 taskq_thread+0x245/0x430 [spl]
 ? __switch_to_asm+0x40/0x70
 ? wake_up_q+0x70/0x70
 kthread+0x104/0x140
 ? task_done+0x90/0x90 [spl]
 ? kthread_park+0x90/0x90
 ret_from_fork+0x35/0x40
INFO: task spl_delay_taskq:381992 blocked for more than 123 seconds.
      Tainted: P           O      5.4.0-126-generic #142-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
spl_delay_taskq D    0 381992      2 0x80004000
Call Trace:
 __schedule+0x2e3/0x740
 schedule+0x42/0xb0
 schedule_timeout+0x10e/0x160
 wait_for_completion+0xb1/0x120
 ? wake_up_q+0x70/0x70
 call_usermodehelper_exec+0x148/0x180
 call_usermodehelper+0x93/0xb0
 zfsctl_snapshot_unmount+0xc8/0x190 [zfs]
 snapentry_expire+0x3d/0xc0 [zfs]
 taskq_thread+0x245/0x430 [spl]
 ? __switch_to_asm+0x40/0x70
 ? wake_up_q+0x70/0x70
 kthread+0x104/0x140
 ? task_done+0x90/0x90 [spl]
 ? kthread_park+0x90/0x90
 ret_from_fork+0x35/0x40
ryao commented 1 year ago

Incident 10

This one didn't cause the system to hang, apparently led to a corrupted file.

zfs-2.1.4-0ubuntu0.1
zfs-kmod-2.1.4-0ubuntu0.1
VERIFY3(0 == remove_reference(hdr, NULL, tag)) failed (0 == 2)
VERIFY3(0 == remove_reference(hdr, NULL, tag)) failed (0 == 1)
PANIC at arc.c:3844:arc_buf_destroy()
PANIC at arc.c:3844:arc_buf_destroy()
Showing stack for process 1162989
CPU: 3 PID: 1162989 Comm: python3 Tainted: P           O      5.15.0-48-generic #54-Ubuntu
Showing stack for process 11026
Hardware name: To Be Filled By O.E.M. E3C252D4U-2T/E3C252D4U-2T/OVH, BIOS 1.31.OV02 01/28/2022
Call Trace:
 <TASK>
 show_stack+0x52/0x5c
 dump_stack_lvl+0x4a/0x63
 dump_stack+0x10/0x16
 spl_dumpstack+0x29/0x2f [spl]
 spl_panic+0xd1/0xe9 [spl]
 ? zfs_ereport_post+0xd7/0x1a0 [zfs]
 ? arc_untransform+0x74/0x90 [zfs]
 ? spl_kmem_cache_free+0xcd/0x140 [spl]
 ? __cond_resched+0x1a/0x50
 ? do_raw_spin_unlock+0x9/0x10 [zfs]
 ? __raw_spin_unlock+0x9/0x10 [zfs]
 ? aggsum_add+0x1b8/0x1d0 [zfs]
 arc_buf_destroy+0xed/0xf0 [zfs]
 dbuf_destroy+0x31/0x3d0 [zfs]
 ? __raw_spin_unlock+0x9/0x10 [zfs]
 dbuf_rele_and_unlock+0x98/0x540 [zfs]
 ? __cond_resched+0x1a/0x50
 ? do_raw_spin_unlock+0x9/0x10 [zfs]
 ? __raw_spin_unlock+0x9/0x10 [zfs]
 ? __cond_resched+0x1a/0x50
 ? down_read+0x13/0xa0
 ? wmsum_add+0xe/0x20 [zfs]
 ? dbuf_read+0x1e6/0x5b0 [zfs]
 dnode_rele_and_unlock+0xd2/0xf0 [zfs]
 dnode_rele+0x39/0x50 [zfs]
 dmu_bonus_hold+0x66/0x90 [zfs]
 sa_buf_hold+0xe/0x20 [zfs]
 zfs_zget+0x5b/0x280 [zfs]
 ? zfs_match_find.constprop.0+0x76/0x100 [zfs]
 zfs_dirent_lock+0x424/0x570 [zfs]
 zfs_dirlook+0x91/0x2d0 [zfs]
 zfs_lookup+0x25e/0x410 [zfs]
 zpl_lookup+0xcf/0x230 [zfs]
 ? get_acl+0x18/0x50
 __lookup_slow+0x85/0x150
 walk_component+0x145/0x1c0
 ? path_init+0x2c0/0x3f0
 path_lookupat+0x6e/0x1c0
 ? avl_find+0x68/0xd0 [zavl]
 filename_lookup+0xcf/0x1d0
 ? __check_object_size+0x1d/0x30
 ? strncpy_from_user+0x44/0x150
 ? getname_flags.part.0+0x4c/0x1b0
 user_path_at_empty+0x3f/0x60
 vfs_statx+0x7a/0x130
 __do_sys_newfstatat+0x36/0x70
 ? __secure_computing+0xa9/0x120
 ? syscall_trace_enter.constprop.0+0xa7/0x1c0
 __x64_sys_newfstatat+0x1c/0x30
 do_syscall_64+0x59/0xc0
 ? exit_to_user_mode_loop+0x10d/0x160
 ? exit_to_user_mode_prepare+0x37/0xb0
 ? syscall_exit_to_user_mode+0x27/0x50
 ? __x64_sys_close+0x11/0x50
 ? do_syscall_64+0x69/0xc0
 ? exit_to_user_mode_prepare+0x37/0xb0
 ? syscall_exit_to_user_mode+0x27/0x50
 ? do_syscall_64+0x69/0xc0
 ? exc_page_fault+0x89/0x170
 entry_SYSCALL_64_after_hwframe+0x61/0xcb
RIP: 0033:0x7f1937dfeeee
Code: 48 89 f2 b9 00 01 00 00 48 89 fe bf 9c ff ff ff e9 07 00 00 00 0f 1f 80 00 00 00 00 f3 0f 1e fa 41 89 ca b8 06 01 00 00 0f 05 <3d> 00 f0 ff ff 77 0b 31 c0 c3 0f 1f 84 00 00 00 00 00 48 8b 15 09
RSP: 002b:00007fff14cf79a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000106
RAX: ffffffffffffffda RBX: 000055e201b98a40 RCX: 00007f1937dfeeee
RDX: 00007fff14cf7a40 RSI: 00007f1937015860 RDI: 00000000ffffff9c
RBP: 000055e201e2c340 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff14cf7a40
R13: 00007f1937c87f10 R14: 00000000ffffff9c R15: 000055e201b98bf0
 </TASK>
CPU: 2 PID: 11026 Comm: z_wr_int_2 Tainted: P           O      5.15.0-48-generic #54-Ubuntu
Hardware name: To Be Filled By O.E.M. E3C252D4U-2T/E3C252D4U-2T/OVH, BIOS 1.31.OV02 01/28/2022
Call Trace:
 <TASK>
 show_stack+0x52/0x5c
 dump_stack_lvl+0x4a/0x63
 dump_stack+0x10/0x16
 spl_dumpstack+0x29/0x2f [spl]
 spl_panic+0xd1/0xe9 [spl]
 ? spl_kmem_cache_free+0xcd/0x140 [spl]
 ? kmem_cache_free+0x26c/0x290
 ? percpu_counter_add+0xf/0x20 [spl]
 ? percpu_counter_dec+0x10/0x20 [spl]
 ? spl_kmem_free_impl+0x29/0x40 [spl]
 arc_buf_destroy+0xed/0xf0 [zfs]
 dbuf_issue_final_prefetch_done+0x30/0x40 [zfs]
 arc_read_done+0x247/0x540 [zfs]
 zio_done+0x3aa/0xe60 [zfs]
 zio_execute+0x8f/0xe0 [zfs]
 taskq_thread+0x21e/0x400 [spl]
 ? wake_up_q+0x90/0x90
 ? zio_execute_stack_check.constprop.0+0x20/0x20 [zfs]
 ? param_set_taskq_kick+0xf0/0xf0 [spl]
 kthread+0x127/0x150
 ? set_kthread_struct+0x50/0x50
 ret_from_fork+0x1f/0x30
 </TASK>

This should be fixed by 13f2b8fb92c23090b9f6e701c8471aef6b8e917b in 2.1.6.

That said, many (all?) of these backtraces represent different issues. They probably should be filed as such.

nim-odoo commented 1 year ago

This should be fixed by https://github.com/openzfs/zfs/commit/13f2b8fb92c23090b9f6e701c8471aef6b8e917b in 2.1.6.

Good to know :+1:

That said, many (all?) of these backtraces represent different issues. They probably should be filed as such.

I'm aware of this, at this point I'm just using this issue to collect all the call traces we have with ZFS for further analysis.

nim-odoo commented 1 year ago

Incident 13

Occurrences: 1

zfs-2.1.4-0ubuntu0.1
zfs-kmod-2.1.4-0ubuntu0.1
PANIC at zfs_znode.c:339:zfs_znode_sa_init()
Showing stack for process 2833053
CPU: 7 PID: 2833053 Comm: python3 Tainted: P           O      5.15.0-48-generic #54-Ubuntu
Hardware name: To Be Filled By O.E.M. E3C252D4U-2T/E3C252D4U-2T/OVH, BIOS 1.31.OV02 01/28/2022
Call Trace:
 <TASK>
 show_stack+0x52/0x5c
 dump_stack_lvl+0x4a/0x63
 dump_stack+0x10/0x16
 spl_dumpstack+0x29/0x2f [spl]
 spl_panic+0xd1/0xe9 [spl]
 ? __cond_resched+0x1a/0x50
 ? do_raw_spin_unlock+0x9/0x10 [zfs]
 ? __raw_spin_unlock+0x9/0x10 [zfs]
 ? dmu_buf_replace_user+0x65/0x80 [zfs]
 ? dmu_buf_set_user+0x13/0x20 [zfs]
 ? dmu_buf_set_user_ie+0x15/0x20 [zfs]
 zfs_znode_sa_init+0xdd/0xf0 [zfs]
 zfs_znode_alloc+0x101/0x5d0 [zfs]
 ? arc_space_consume+0x54/0x100 [zfs]
 ? dbuf_read_bonus+0x102/0x110 [zfs]
 ? RW_WRITE_HELD+0x9/0x30 [zfs]
 ? dmu_buf_unlock_parent+0x65/0xa0 [zfs]
 ? do_raw_spin_unlock+0x9/0x10 [zfs]
 ? dbuf_read_impl.constprop.0+0x352/0x490 [zfs]
 ? dbuf_rele_and_unlock+0x134/0x540 [zfs]
 ? __cond_resched+0x1a/0x50
 ? __cond_resched+0x1a/0x50
 ? do_raw_spin_unlock+0x9/0x10 [zfs]
 ? __raw_spin_unlock+0x9/0x10 [zfs]
 ? __cond_resched+0x1a/0x50
 ? down_read+0x13/0xa0
 ? wmsum_add+0xe/0x20 [zfs]
 ? dbuf_read+0x1e6/0x5b0 [zfs]
 ? __cond_resched+0x1a/0x50
 ? do_raw_spin_unlock+0x9/0x10 [zfs]
 ? __raw_spin_unlock+0x9/0x10 [zfs]
 ? dnode_rele_and_unlock+0x69/0xf0 [zfs]
 ? do_raw_spin_unlock+0x9/0x10 [zfs]
 ? RW_WRITE_HELD+0x9/0x30 [zfs]
 ? dmu_object_info_from_dnode+0x92/0xa0 [zfs]
 zfs_zget+0x239/0x280 [zfs]
 zfs_dirent_lock+0x424/0x570 [zfs]
 zfs_dirlook+0x91/0x2d0 [zfs]
 zfs_lookup+0x25e/0x410 [zfs]
 zpl_lookup+0xcf/0x230 [zfs]
 ? get_acl+0x18/0x50
 __lookup_slow+0x85/0x150
 walk_component+0x145/0x1c0
 ? path_init+0x2c0/0x3f0
 path_lookupat+0x6e/0x1c0
 ? rrm_exit+0x4c/0x80 [zfs]
 filename_lookup+0xcf/0x1d0
 ? __check_object_size+0x1d/0x30
 ? strncpy_from_user+0x44/0x150
 ? getname_flags.part.0+0x4c/0x1b0
 user_path_at_empty+0x3f/0x60
 vfs_statx+0x7a/0x130
 __do_sys_newfstatat+0x36/0x70
 ? __secure_computing+0xa9/0x120
 ? syscall_trace_enter.constprop.0+0xa7/0x1c0
 __x64_sys_newfstatat+0x1c/0x30
 do_syscall_64+0x59/0xc0
 entry_SYSCALL_64_after_hwframe+0x61/0xcb
RIP: 0033:0x7f250e35beee
Code: 48 89 f2 b9 00 01 00 00 48 89 fe bf 9c ff ff ff e9 07 00 00 00 0f 1f 80 00 00 00 00 f3 0f 1e fa 41 89 ca b8 06 01 00 00 0f 05 <3d> 00 f0 ff ff 77 0b 31 c0 c3 0f 1f 84 00 00 00 00 00 48 8b 15 09
RSP: 002b:00007ffdb8772198 EFLAGS: 00000246 ORIG_RAX: 0000000000000106
RAX: ffffffffffffffda RBX: 0000560e8093da40 RCX: 00007f250e35beee
RDX: 00007ffdb8772230 RSI: 00007f2502d7e690 RDI: 00000000ffffff9c
RBP: 0000560e821453c0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffdb8772230
R13: 00007f250e1e8130 R14: 00000000ffffff9c R15: 0000560e8093dbf0
 </TASK>
nim-odoo commented 1 year ago

Incident 14

Occurrence: 3 Did not cause hanging.

zfs-2.1.6-2.1
zfs-kmod-2.1.6-2.1
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0 
Oops: 0000 [#1] SMP NOPTI
CPU: 5 PID: 2908962 Comm: z_wr_int_2 Tainted: P           OE     5.15.0-52-generic #58-Ubuntu
Hardware name: To Be Filled By O.E.M. E3C252D4U-2T/E3C252D4U-2T/OVH, BIOS 1.31.OV02 01/28/2022
RIP: 0010:abd_iterate_func.part.0+0x31/0x1c0 [zfs]
Code: 89 e5 41 57 4d 89 c7 41 56 41 55 41 54 53 48 89 d3 48 83 ec 68 48 89 7d 88 48 89 4d 80 65 48 8b 04 25 28 00 00 00 48 89 45 d0 <8b> 07 48 89 75 98 41 89 c6 41 c1 ee 06 44 89 f0 83 e0 01 89 45 90
RSP: 0018:ffffbd7f0d3f3b18 EFLAGS: 00010296
RAX: a5e9f645e72d7d00 RBX: 0000000000004000 RCX: ffffffffc0dc0000
RDX: 0000000000004000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffbd7f0d3f3ba8 R08: ffffbd7f0d3f3bb8 R09: ffff976e587cc000
R10: ffff976caf9d7388 R11: ffffffffc101dd00 R12: ffff97682d193d70
R13: 0000000000000007 R14: 0000000000000002 R15: ffffbd7f0d3f3bb8
FS:  0000000000000000(0000) GS:ffff97741fb40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000001c9dfa004 CR4: 0000000000770ee0
PKRU: 55555554
Call Trace:
 <TASK>
 ? 0xffffffffc0dc0000
 ? kmem_cache_alloc+0x1ab/0x2f0
 abd_copy_to_buf_off+0x3e/0x60 [zfs]
 arc_buf_fill+0x214/0xd00 [zfs]
 ? aggsum_add+0x1a3/0x1c0 [zfs]
 arc_buf_alloc_impl.isra.0+0x2f5/0x4b0 [zfs]
 ? spl_kmem_cache_free+0x145/0x200 [spl]
 arc_read_done+0x106/0x530 [zfs]
 zio_done+0x3fe/0x1260 [zfs]
 zio_execute+0x94/0x160 [zfs]
 taskq_thread+0x29c/0x4c0 [spl]
 ? wake_up_q+0x90/0x90
 ? zio_gang_tree_free+0x70/0x70 [zfs]
 ? taskq_thread_spawn+0x60/0x60 [spl]
 kthread+0x127/0x150
 ? set_kthread_struct+0x50/0x50
 ret_from_fork+0x1f/0x30
 </TASK>
Modules linked in: xt_nat veth tls nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink xfrm_user xfrm_algo br_netfilter bridge stp llc overlay isofs ip6t_REJECT nf_reject_ipv6 xt_hl ip6_tables ip6t_rt ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nft_counter nf_tables nfnetlink nls_iso8859_1 ipmi_ssif zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) kvm_intel zcommon(POE) znvpair(POE) spl(OE) kvm joydev input_leds acpi_ipmi ipmi_si video acpi_tad sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ipmi_devintf ipmi_msghandler msr efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor hid_generic usbhid hid raid6_pq libcrc32c raid0 multipath linear ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect
 crct10dif_pclmul crc32_pclmul sysimgblt fb_sys_fops ghash_clmulni_intel raid1 cec aesni_intel rc_core nvme crypto_simd ahci xhci_pci cryptd i40e drm libahci nvme_core xhci_pci_renesas
CR2: 0000000000000000
---[ end trace 65809e962f72b871 ]---
RIP: 0010:abd_iterate_func.part.0+0x31/0x1c0 [zfs]
wodin commented 1 year ago

@nim-odoo have you considered trying a FreeBSD box as a test? I'd be interested to know if you get similar problems.

This issue is pretty horrifying.

I got a panic with a similar looking stack trace to the first one here, but the address in my case was 0000000000000000 rather than 000000000000000f.

In my case I am also not using send/receive, but unlike you I am not using encryption. (<- @rincebrain)

BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 45c289067 P4D 45c289067 PUD 1d78af067 PMD 0 
Oops: 0002 [#1] SMP NOPTI
CPU: 9 PID: 3483572 Comm: zsysd Tainted: P         C OE     5.15.0-52-generic #58-Ubuntu
Hardware name: Micro-Star International Co., Ltd. MS-7C83/MAG B460M BAZOOKA (MS-7C83), BIOS A.00 05/13/2020
RIP: 0010:LZ4_uncompress_unknownOutputSize+0x36/0x310 [zfs]
Code: 48 89 e5 41 57 41 56 4c 8d 34 0e 41 55 4c 8d 2c 17 41 54 53 48 83 ec 28 4c 39 ef 0f 83 65 02 00 00 49 89 ff 49 89 f4 48 89 f7 <41> 0f b6 17 49 8d 47 01 89 d6 c1 ea 04 41 89 d0 83 fa 0f 0f 84 0d
RSP: 0018:ffffbfcc113ff728 EFLAGS: 00010287
RAX: 0000000000000000 RBX: ffff9a67aee1ea0a RCX: ffff9a67aee1ebf8
RDX: ffff9a67aee1ea02 RSI: 0000000000000000 RDI: ffff9a67aee1ea07
RBP: ffffbfcc113ff778 R08: 0000000000000001 R09: 0000000000000001
R10: ffffffffc29ce1b0 R11: ffff9a680a0afad8 R12: ffff9a67aee1ea00
R13: ffff9a67aee1e635 R14: ffff9a67aee1ec00 R15: ffff9a67aee1e609
FS:  00007f9d4c4dc800(0000) GS:ffff9a739ec40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000482e76006 CR4: 00000000007706e0
PKRU: 55555554
Call Trace:
 <TASK>
 lz4_decompress_zfs+0x2f/0x40 [zfs]
 zio_decompress_data_buf+0x8c/0x100 [zfs]
 zio_decompress_data+0x51/0xc0 [zfs]
 zio_decompress+0x62/0xe0 [zfs]
 zio_pop_transforms+0x71/0xa0 [zfs]
 zio_done+0x24c/0xe60 [zfs]
 ? __raw_spin_unlock+0x9/0x10 [zfs]
 ? zio_ready+0x1bf/0x3f0 [zfs]
 zio_nowait+0xa6/0x130 [zfs]
 arc_read+0xb73/0x15c0 [zfs]
 ? dbuf_rele_and_unlock+0x540/0x540 [zfs]
 ? dnode_block_freed+0xdd/0x150 [zfs]
 dbuf_read_impl.constprop.0+0x2f2/0x490 [zfs]
 dbuf_read+0x1ba/0x5b0 [zfs]
 ? dmu_buf_hold_noread+0xc3/0x110 [zfs]
 dmu_buf_hold+0x66/0xa0 [zfs]
 zap_lockdir+0x51/0xb0 [zfs]
 zap_lookup_norm+0x5c/0xd0 [zfs]
 zap_lookup+0x16/0x20 [zfs]
 dsl_dir_hold+0x11b/0x280 [zfs]
 dsl_dataset_hold_flags+0x45/0x230 [zfs]
 ? rrw_enter_read+0x13/0x20 [zfs]
 ? rrw_enter+0x21/0x30 [zfs]
 ? dsl_pool_config_enter+0x1d/0x30 [zfs]
 dmu_objset_hold_flags+0x78/0xd0 [zfs]
 dmu_objset_hold+0x16/0x20 [zfs]
 zfs_ioc_objset_stats+0x33/0x80 [zfs]
 zfsdev_ioctl_common+0x683/0x740 [zfs]
 ? __check_object_size.part.0+0x4a/0x150
 ? _copy_from_user+0x2e/0x70
 zfsdev_ioctl+0x57/0xf0 [zfs]
 __x64_sys_ioctl+0x92/0xd0
 do_syscall_64+0x59/0xc0
 ? exc_page_fault+0x89/0x170
 entry_SYSCALL_64_after_hwframe+0x61/0xcb
RIP: 0033:0x7f9d4cd6daff
Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0 3d 00 f0 ff ff 77 1f 48 8b 44 24 18 64 48 2b 04 25 28 00
RSP: 002b:00007ffe9ee38080 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffe9ee38100 RCX: 00007f9d4cd6daff
RDX: 00007ffe9ee38100 RSI: 0000000000005a12 RDI: 0000000000000004
RBP: 00007ffe9ee380f0 R08: 0000000000000000 R09: 0000000002af7a50
R10: 00007f9d4ce6d1b0 R11: 0000000000000246 R12: 0000000001eaf9c0
R13: 00007ffe9ee38100 R14: 000000c0000021a0 R15: 00000000000001da
 </TASK>
Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache netfs xt_nat veth nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_compat nft_masq nft_chain_nat nf_nat nf_conntrack ashmem_linux(C) binder_linux nf_defrag_ipv6 nf_defrag_ipv4 nft_counter nf_tables vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) nfnetlink bridge stp llc wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel overlay sunrpc binfmt_misc nls_iso8859_1 joydev input_leds snd_sof_pci_intel_cnl snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi soundwire_bus snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine intel_rapl_msr intel_rapl_common intel_tcc_cooling
 x86_pkg_temp_thermal intel_powerclamp coretemp nvidia_uvm(POE) kvm_intel kvm snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer snd soundcore mxm_wmi wmi_bmof mei_hdcp mei_me mei ee1004 rapl intel_cstate mac_hid acpi_pad acpi_tad sch_fq_codel ipmi_devintf ipmi_msghandler msr parport_pc ppdev lp ramoops parport reed_solomon efi_pstore pstore_blk pstore_zone ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) drm_kms_helper syscopyarea sysfillrect crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sysimgblt fb_sys_fops cec
 aesni_intel rc_core crypto_simd cryptd drm r8169 ahci i2c_i801 xhci_pci i2c_smbus realtek libahci xhci_pci_renesas wmi video pinctrl_sunrisepoint
CR2: 0000000000000000
---[ end trace fd6f805a9fb1c2cb ]---
RIP: 0010:LZ4_uncompress_unknownOutputSize+0x36/0x310 [zfs]
Code: 48 89 e5 41 57 41 56 4c 8d 34 0e 41 55 4c 8d 2c 17 41 54 53 48 83 ec 28 4c 39 ef 0f 83 65 02 00 00 49 89 ff 49 89 f4 48 89 f7 <41> 0f b6 17 49 8d 47 01 89 d6 c1 ea 04 41 89 d0 83 fa 0f 0f 84 0d
RSP: 0018:ffffbfcc113ff728 EFLAGS: 00010287
RAX: 0000000000000000 RBX: ffff9a67aee1ea0a RCX: ffff9a67aee1ebf8
RDX: ffff9a67aee1ea02 RSI: 0000000000000000 RDI: ffff9a67aee1ea07
RBP: ffffbfcc113ff778 R08: 0000000000000001 R09: 0000000000000001
R10: ffffffffc29ce1b0 R11: ffff9a680a0afad8 R12: ffff9a67aee1ea00
R13: ffff9a67aee1e635 R14: ffff9a67aee1ec00 R15: ffff9a67aee1e609
FS:  00007f9d4c4dc800(0000) GS:ffff9a739ec40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000482e76006 CR4: 00000000007706e0
PKRU: 55555554

Fortunately this has only happened to me once so far, but it's put a damper on my enthusiasm towards ZFS.

nim-odoo commented 1 year ago

@wodin At this point we switched to Ubuntu 22.04 (seems your case as well based on the kernel version) with ZFS 2.1.6 packaged from^1. We haven't faced the LZ4_Uncompress error anymore, but we are facing Incident 9^2 on a regular basis.

We haven't tested on FreeBSD.

wodin commented 1 year ago

Thanks for the reply.

Yes, mine happened on Ubuntu 22.04 (filesystems created on 20.04 and later upgraded).