openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.42k stars 1.73k forks source link

Kernel oops page fault triggered by Docker in arc_prune #16324

Open maxpoulin64 opened 2 months ago

maxpoulin64 commented 2 months ago

System information

Type Version/Name
Distribution Name ArchLinux
Distribution Version Latest
Kernel Version 6.8.9-zen1-1-zen
Architecture x86_64
OpenZFS Version 2.2.4-2

I'm holding to 6.8.9 specifically to stay within official supported kernel versions.

Describe the problem you're observing

Extracting large container images in Docker causes ZFS to trigger an unhandled page fault, and permanently locks up the filesystem until reboot. Sync will never complete, and normal shutdown also doesn't complete.

Describe how to reproduce the problem

Running this particular container reliably hangs ZFS on my system during extraction, using Docker's ZFS storage driver.

docker run -it --rm -p 8080:8080 --gpus all --name localai quay.io/go-skynet/local-ai:latest-aio-gpu-hipblas

It gets stuck on a line such as this one and never completes, killing the Docker daemon makes it a zombie, IO is completely hosed.

6ddbee975253: Extracting  352.2MB/352.2MB

Include any warning/errors/backtraces from the system logs

[184791.050957] BUG: unable to handle page fault for address: 00000000208db6e0
[184791.050969] #PF: supervisor instruction fetch in kernel mode
[184791.050972] #PF: error_code(0x0010) - not-present page
[184791.050975] PGD 0 P4D 0 
[184791.050981] Oops: 0010 [#1] PREEMPT SMP NOPTI
[184791.050985] CPU: 11 PID: 482 Comm: arc_prune Tainted: P        W  OE      6.8.9-zen1-1-zen #1 b3e4ad3c9dbde87c9fb9d46fb90ca62a28a66a12
[184791.050992] Hardware name: Micro-Star International Co., Ltd. MS-7B09/X399 GAMING PRO CARBON AC (MS-7B09), BIOS 1.B0 08/09/2018
[184791.050995] RIP: 0010:0x208db6e0
[184791.051042] Code: Unable to access opcode bytes at 0x208db6b6.
[184791.051045] RSP: 0018:ffffb417d2293ce0 EFLAGS: 00010246
[184791.051049] RAX: 00000000208db6e0 RBX: ffffb417d2293d94 RCX: 0000000000000000
[184791.051052] RDX: 0000000000000000 RSI: ffffb417d2293d30 RDI: ffff97e1ac586a80
[184791.051056] RBP: 0000000000003ae0 R08: 0000000000006d66 R09: ffff97e4860e2e90
[184791.051059] R10: ffff97e4860e2e80 R11: ffff97e1f96c0000 R12: ffff97e538d00000
[184791.051063] R13: ffff97e48bf9d780 R14: ffff97e4860e2e28 R15: ffff97e1ac586a80
[184791.051066] FS:  0000000000000000(0000) GS:ffff97e46e4c0000(0000) knlGS:0000000000000000
[184791.051070] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[184791.051074] CR2: 00000000208db6e0 CR3: 000000019b3e6000 CR4: 00000000003506f0
[184791.051077] Call Trace:
[184791.051082]  <TASK>
[184791.051085]  ? __die+0x10f/0x120
[184791.051092]  ? page_fault_oops+0x171/0x4e0
[184791.051101]  ? exc_page_fault+0x7f/0x180
[184791.051107]  ? asm_exc_page_fault+0x26/0x30
[184791.051119]  ? zfs_prune+0xb0/0x4e0 [zfs 158ff065068c3ea6e221f98356463834dc655cec]
[184791.051438]  ? zpl_prune_sb+0x36/0x60 [zfs 158ff065068c3ea6e221f98356463834dc655cec]
[184791.051653]  ? arc_prune_task+0x22/0x40 [zfs 158ff065068c3ea6e221f98356463834dc655cec]
[184791.051880]  ? taskq_thread+0x2d4/0x6f0 [spl 44541b25f59ba0491e81482257bd475148318e14]
[184791.051901]  ? srso_return_thunk+0x5/0x5f
[184791.051907]  ? finish_task_switch.isra.0+0x94/0x2f0
[184791.051914]  ? __pfx_default_wake_function+0x10/0x10
[184791.051924]  ? __pfx_taskq_thread+0x10/0x10 [spl 44541b25f59ba0491e81482257bd475148318e14]
[184791.051941]  ? kthread+0xe8/0x120
[184791.051946]  ? __pfx_kthread+0x10/0x10
[184791.051951]  ? ret_from_fork+0x34/0x50
[184791.051955]  ? __pfx_kthread+0x10/0x10
[184791.051960]  ? ret_from_fork_asm+0x1b/0x30
[184791.051969]  </TASK>
[184791.051971] Modules linked in: xt_conntrack nf_conntrack_netlink xfrm_user xfrm_algo ip6table_nat ip6table_filter ip6_tables xt_addrtype br_netfilter overlay rfcomm snd_seq_dummy snd_hrtimer snd_seq wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel bridge stp llc uhid cmac algif_hash algif_skcipher af_alg xt_MASQUERADE bnep xt_nat iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c crc32c_generic iptable_filter dm_crypt cbc encrypted_keys vfat fat intel_rapl_msr intel_rapl_common btusb snd_hda_codec_realtek btrtl crct10dif_pclmul snd_hda_codec_generic btintel crc32_pclmul iwlmvm snd_hda_codec_hdmi btbcm crc32c_intel snd_usb_audio btmtk polyval_clmulni snd_hda_intel snd_usbmidi_lib polyval_generic mac80211 gf128mul libarc4 snd_intel_dspcfg snd_ump ghash_clmulni_intel snd_intel_sdw_acpi bluetooth snd_rawmidi sha512_ssse3 joydev snd_seq_device snd_hda_codec sha256_ssse3 ecdh_generic iwlwifi mousedev sha1_ssse3 mc
[184791.052084]  razerkbd(OE) crc16 aesni_intel snd_hda_core crypto_simd snd_hwdep cryptd snd_pcm igb cfg80211 rapl ptp snd_timer sp5100_tco pps_core gpio_amdpt snd dca wmi_bmof rfkill pcspkr soundcore gpio_generic mxm_wmi i2c_piix4 k10temp mac_hid kvmfr(OE) sg crypto_user loop nfnetlink ip_tables x_tables hid_steam ff_memless hid_logitech_hidpp hid_logitech_dj hid_generic trusted asn1_encoder tee dm_mod usbhid amdgpu vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd video amdxcp i2c_algo_bit drm_ttm_helper ttm kvm_amd drm_exec gpu_sched drm_suballoc_helper kvm nvme drm_buddy nvme_core drm_display_helper xhci_pci irqbypass cec ccp nvme_auth xhci_pci_renesas wmi zfs(POE) spl(OE) vendor_reset(OE) nct6775 nct6775_core hwmon_vid i2c_dev
[184791.052189] CR2: 00000000208db6e0
[184791.052193] ---[ end trace 0000000000000000 ]---
[184791.052196] RIP: 0010:0x208db6e0
[184791.052216] Code: Unable to access opcode bytes at 0x208db6b6.
[184791.052219] RSP: 0018:ffffb417d2293ce0 EFLAGS: 00010246
[184791.052223] RAX: 00000000208db6e0 RBX: ffffb417d2293d94 RCX: 0000000000000000
[184791.052226] RDX: 0000000000000000 RSI: ffffb417d2293d30 RDI: ffff97e1ac586a80
[184791.052229] RBP: 0000000000003ae0 R08: 0000000000006d66 R09: ffff97e4860e2e90
[184791.052232] R10: ffff97e4860e2e80 R11: ffff97e1f96c0000 R12: ffff97e538d00000
[184791.052235] R13: ffff97e48bf9d780 R14: ffff97e4860e2e28 R15: ffff97e1ac586a80
[184791.052238] FS:  0000000000000000(0000) GS:ffff97e46e4c0000(0000) knlGS:0000000000000000
[184791.052241] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[184791.052244] CR2: 00000000208db6e0 CR3: 000000019b3e6000 CR4: 00000000003506f0
[184791.052248] note: arc_prune[482] exited with irqs disabled

The stack trace is always the same. Disk passes scrub with 0 errors after rebooting.

TheUbuntuGuy commented 2 months ago

I've seen this several times on several systems running ZFS 2.2.4 and Linux 6.8. Reverting to Linux 6.6 is stable. We also heavily use Docker with ZFS. Our systems are 24 - 128 thread systems running software build jobs in parallel using Docker containers. They see heavy CPU usage and IO over NVMe.

I have a trace which looks similar to yours (arc_prune), and I also started seeing at the same time another NULL pointer deference in a different stack trace referencing iptables. They both happen together so I'll include that trace here, but I don't know if it is actually related.

An example of arc_prune:

May 09 12:08:04 pingu kernel: BUG: unable to handle page fault for address: 0000000200000000
May 09 12:08:04 pingu kernel: #PF: supervisor instruction fetch in kernel mode
May 09 12:08:04 pingu kernel: #PF: error_code(0x0010) - not-present page
May 09 12:08:04 pingu kernel: PGD 15e5cce067 P4D 15e5cce067 PUD 0 
May 09 12:08:04 pingu kernel: Oops: 0010 [#1] PREEMPT SMP NOPTI
May 09 12:08:04 pingu kernel: CPU: 102 PID: 1366 Comm: arc_prune Tainted: P           O       6.8.9 #1
May 09 12:08:04 pingu kernel: Hardware name: Micro-Star International Co., Ltd. MS-7C60/TRX40 PRO 10G (MS-7C60), BIOS 1.60 05/13/2020
May 09 12:08:04 pingu kernel: RIP: 0010:0x200000000
May 09 12:08:04 pingu kernel: Code: Unable to access opcode bytes at 0x1ffffffd6.
May 09 12:08:04 pingu kernel: RSP: 0018:ffffb59e438cfd08 EFLAGS: 00010246
May 09 12:08:04 pingu kernel: RAX: 0000000200000000 RBX: ffff9c0fab2f0000 RCX: 0000000000000000
May 09 12:08:04 pingu kernel: RDX: 0000000000000001 RSI: ffffb59e438cfd58 RDI: ffff9bfb360f7180
May 09 12:08:04 pingu kernel: RBP: ffffb59e438cfdbc R08: 000000000000fdd9 R09: ffff9c38fe7ba4c0
May 09 12:08:04 pingu kernel: R10: 000000000000050a R11: 0000000000000066 R12: 000000000000fdd9
May 09 12:08:04 pingu kernel: R13: ffff9bfaf9386000 R14: ffff9c0fab2f00f8 R15: ffff9bfb360f7180
May 09 12:08:04 pingu kernel: FS:  0000000000000000(0000) GS:ffff9c38fe780000(0000) knlGS:0000000000000000
May 09 12:08:04 pingu kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 09 12:08:04 pingu kernel: CR2: 0000000200000000 CR3: 000000027fb1a000 CR4: 0000000000350ef0
May 09 12:08:04 pingu kernel: Call Trace:
May 09 12:08:04 pingu kernel:  <TASK>
May 09 12:08:04 pingu kernel:  ? __die_body+0x1b/0x60
May 09 12:08:04 pingu kernel:  ? page_fault_oops+0x15d/0x470
May 09 12:08:04 pingu kernel:  ? __mod_node_page_state+0x82/0xc0
May 09 12:08:04 pingu kernel:  ? exc_page_fault+0x74/0x170
May 09 12:08:04 pingu kernel:  ? asm_exc_page_fault+0x22/0x30
May 09 12:08:04 pingu kernel:  ? _raw_spin_unlock+0x15/0x30
May 09 12:08:04 pingu kernel:  ? zfs_prune+0x9c/0x4a0 [zfs]
May 09 12:08:04 pingu kernel:  ? autoremove_wake_function+0xe/0x30
May 09 12:08:04 pingu kernel:  ? zpl_prune_sb+0x34/0x50 [zfs]
May 09 12:08:04 pingu kernel:  ? arc_prune_task+0x1b/0x30 [zfs]
May 09 12:08:04 pingu kernel:  ? taskq_thread+0x26e/0x470 [spl]
May 09 12:08:04 pingu kernel:  ? wake_up_state+0x10/0x10
May 09 12:08:04 pingu kernel:  ? task_done+0x90/0x90 [spl]
May 09 12:08:04 pingu kernel:  ? kthread+0xee/0x120
May 09 12:08:04 pingu kernel:  ? kthread_complete_and_exit+0x20/0x20
May 09 12:08:04 pingu kernel:  ? ret_from_fork+0x2d/0x50
May 09 12:08:04 pingu kernel:  ? kthread_complete_and_exit+0x20/0x20
May 09 12:08:04 pingu kernel:  ? ret_from_fork_asm+0x11/0x20
May 09 12:08:04 pingu kernel:  </TASK>
May 09 12:08:04 pingu kernel: Modules linked in: xt_nat macvtap macvlan rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace netfs vhost_net vhost vhost_iotlb tap iptable_nat iptable_filter veth tcp_diag udp_diag inet_diag nf_conntrack_netlink xfrm_user xt_addrtype br_netfi
lter xt_CHECKSUM xt_MASQUERADE ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_chain_nat nf_nat bridge stp llc xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables libcrc32c nfnetlink nvme_fabrics overlay bonding tls cfg80211 sunrpc binfmt_misc amdgpu drm
_exec amdxcp drm_buddy gpu_sched intel_rapl_msr intel_rapl_common edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 sha256_ssse3 snd_hda_codec_hdmi sha1_ssse3 radeon aesni_intel nls_iso8859_1 snd_hda_inte
l snd_intel_dspcfg crypto_simd snd_intel_sdw_acpi drm_suballoc_helper cryptd drm_ttm_helper snd_hda_codec ttm snd_hda_core drm_display_helper snd_pcsp snd_hwdep cec snd_pcm rc_core rapl snd_timer drm_kms_helper wmi_bmof snd video
May 09 12:08:04 pingu kernel:  soundcore ccp mxm_wmi sp5100_tco k10temp joydev mac_hid nct6775 nct6775_core hwmon_vid drm efi_pstore dmi_sysfs ip_tables x_tables autofs4 zfs(PO) spl(O) hid_generic usbhid hid crc32_pclmul ixgbe igb ahci xfrm_algo i2c_algo_bit mdio libahci
 dca xhci_pci xhci_pci_renesas i2c_piix4 wmi
May 09 12:08:04 pingu kernel: CR2: 0000000200000000
May 09 12:08:04 pingu kernel: ---[ end trace 0000000000000000 ]---
May 09 12:08:04 pingu kernel: RIP: 0010:0x200000000
May 09 12:08:04 pingu kernel: Code: Unable to access opcode bytes at 0x1ffffffd6.
May 09 12:08:04 pingu kernel: RSP: 0018:ffffb59e438cfd08 EFLAGS: 00010246
May 09 12:08:04 pingu kernel: RAX: 0000000200000000 RBX: ffff9c0fab2f0000 RCX: 0000000000000000
May 09 12:08:04 pingu kernel: RDX: 0000000000000001 RSI: ffffb59e438cfd58 RDI: ffff9bfb360f7180
May 09 12:08:04 pingu kernel: RBP: ffffb59e438cfdbc R08: 000000000000fdd9 R09: ffff9c38fe7ba4c0
May 09 12:08:04 pingu kernel: R10: 000000000000050a R11: 0000000000000066 R12: 000000000000fdd9
May 09 12:08:04 pingu kernel: R13: ffff9bfaf9386000 R14: ffff9c0fab2f00f8 R15: ffff9bfb360f7180
May 09 12:08:04 pingu kernel: FS:  0000000000000000(0000) GS:ffff9c38fe780000(0000) knlGS:0000000000000000
May 09 12:08:04 pingu kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 09 12:08:04 pingu kernel: CR2: 0000000200000000 CR3: 000000027fb1a000 CR4: 0000000000350ef0
May 09 12:08:04 pingu kernel: note: arc_prune[1366] exited with irqs disabled

An example of iptables on the same system, same kernel:

May 09 15:03:05 pingu kernel: BUG: kernel NULL pointer dereference, address: 0000000000000015
May 09 15:03:05 pingu kernel: #PF: supervisor write access in kernel mode
May 09 15:03:05 pingu kernel: #PF: error_code(0x0002) - not-present page
May 09 15:03:05 pingu kernel: PGD 0 P4D 0 
May 09 15:03:05 pingu kernel: Oops: 0002 [#1] PREEMPT SMP NOPTI
May 09 15:03:05 pingu kernel: CPU: 65 PID: 232279 Comm: iptables Tainted: P           O       6.8.9 #1
May 09 15:03:05 pingu kernel: Hardware name: Micro-Star International Co., Ltd. MS-7C60/TRX40 PRO 10G (MS-7C60), BIOS 1.60 05/13/2020
May 09 15:03:05 pingu kernel: RIP: 0010:iptable_nat_table_init+0xe6/0x170 [iptable_nat]
May 09 15:03:05 pingu kernel: Code: 48 89 ef 48 89 0c 24 e8 98 e9 d7 ff 48 8b 0c 24 85 c0 41 89 c4 75 30 41 83 c7 01 48 83 c1 28 41 83 ff 04 75 d4 48 8b 44 24 08 <4c> 89 30 4c 89 ef e8 3f ec 3c ee 48 83 c4 10 44 89 e0 5b 5d 41 5c
May 09 15:03:05 pingu kernel: RSP: 0018:ffffb71601a93bc8 EFLAGS: 00010246
May 09 15:03:05 pingu kernel: RAX: 0000000000000015 RBX: ffff99c83a710120 RCX: ffff99bba6d73960
May 09 15:03:05 pingu kernel: RDX: 0000000000000000 RSI: 0000000000000064 RDI: ffffffffc1edf1e0
May 09 15:03:05 pingu kernel: RBP: ffff99bd0a29ec00 R08: ffff99f47efd3000 R09: 0000000000000000
May 09 15:03:05 pingu kernel: R10: ffffb71601a93b20 R11: ffff99b696c19088 R12: 0000000000000000
May 09 15:03:05 pingu kernel: R13: ffff99cc9cd6a400 R14: ffff99bba6d738c0 R15: 0000000000000004
May 09 15:03:05 pingu kernel: FS:  00007f9c8074bb48(0000) GS:ffff99f37de40000(0000) knlGS:0000000000000000
May 09 15:03:05 pingu kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 09 15:03:05 pingu kernel: CR2: 0000000000000015 CR3: 00000017f53e2000 CR4: 0000000000350ef0
May 09 15:03:05 pingu kernel: Call Trace:
May 09 15:03:05 pingu kernel:  <TASK>
May 09 15:03:05 pingu kernel:  ? __die_body+0x1b/0x60
May 09 15:03:05 pingu kernel:  ? page_fault_oops+0x15d/0x470
May 09 15:03:05 pingu kernel:  ? do_user_addr_fault+0x65/0x830
May 09 15:03:05 pingu kernel:  ? __kmalloc_node+0x3d3/0x3e0
May 09 15:03:05 pingu kernel:  ? __nf_hook_entries_try_shrink+0x140/0x140
May 09 15:03:05 pingu kernel:  ? exc_page_fault+0x74/0x170
May 09 15:03:05 pingu kernel:  ? asm_exc_page_fault+0x22/0x30
May 09 15:03:05 pingu kernel:  ? iptable_nat_table_init+0xe6/0x170 [iptable_nat]
May 09 15:03:05 pingu kernel:  ? iptable_nat_table_init+0xc8/0x170 [iptable_nat]
May 09 15:03:05 pingu kernel:  xt_find_table_lock+0x128/0x1b0 [x_tables]
May 09 15:03:05 pingu kernel:  xt_request_find_table_lock+0x1b/0x70 [x_tables]
May 09 15:03:05 pingu kernel:  get_info+0x82/0x300 [ip_tables]
May 09 15:03:05 pingu kernel:  ? mntput_no_expire+0x4a/0x240
May 09 15:03:05 pingu kernel:  ? __local_bh_enable_ip+0x37/0x80
May 09 15:03:05 pingu kernel:  ? do_ip_getsockopt+0x85d/0xca0
May 09 15:03:05 pingu kernel:  do_ipt_get_ctl+0x6c/0x330 [ip_tables]
May 09 15:03:05 pingu kernel:  ? obj_cgroup_charge+0xf0/0x110
May 09 15:03:05 pingu kernel:  ? kmem_cache_alloc+0x122/0x2a0
May 09 15:03:05 pingu kernel:  nf_getsockopt+0x44/0x70
May 09 15:03:05 pingu kernel:  ip_getsockopt+0x82/0xc0
May 09 15:03:05 pingu kernel:  do_sock_getsockopt+0x9b/0x220
May 09 15:03:05 pingu kernel:  __sys_getsockopt+0x72/0xc0
May 09 15:03:05 pingu kernel:  __x64_sys_getsockopt+0x21/0x30
May 09 15:03:05 pingu kernel:  do_syscall_64+0x44/0xd0
May 09 15:03:05 pingu kernel:  entry_SYSCALL_64_after_hwframe+0x4b/0x53
May 09 15:03:05 pingu kernel: RIP: 0033:0x7f9c806ed484
May 09 15:03:05 pingu kernel: Code: 31 c9 b8 33 00 00 00 0f 05 48 89 c7 e8 da 29 fe ff 5a c3 49 89 ca 50 48 63 d2 48 63 f6 48 63 ff 45 31 c9 b8 37 00 00 00 0f 05 <48> 63 f8 e8 b9 29 fe ff 5a c3 64 48 8b 04 25 00 00 00 00 48 83 78
May 09 15:03:05 pingu kernel: RSP: 002b:00007ffcccafdb40 EFLAGS: 00000246 ORIG_RAX: 0000000000000037
May 09 15:03:05 pingu kernel: RAX: ffffffffffffffda RBX: 00007ffcccafdf18 RCX: 00007f9c806ed484
May 09 15:03:05 pingu kernel: RDX: 0000000000000040 RSI: 0000000000000000 RDI: 0000000000000004
May 09 15:03:05 pingu kernel: RBP: 0000000000000004 R08: 00007ffcccafdb6c R09: 0000000000000000
May 09 15:03:05 pingu kernel: R10: 00007ffcccafdb74 R11: 0000000000000246 R12: 00007ffcccafdb74
May 09 15:03:05 pingu kernel: R13: 0000000000000002 R14: 00007f9c8074b940 R15: 0000000000000000
May 09 15:03:05 pingu kernel:  </TASK>
bpwats commented 2 months ago

I have the same problem on the latest Unraid 7.0.0-beta-1 prerelease. kernel: Linux version 6.8.12-Unraid (root@Develop) (gcc (GCC) 13.2.0, GNU ld version 2.42-slack151) #3 SMP PREEMPT_DYNAMIC Tue Jun 18 07:52:57 PDT 2024

Jul 4 21:21:40 Fractal kernel: BUG: unable to handle page fault for address: 0000000200000002 Jul 4 21:21:40 Fractal kernel: #PF: supervisor instruction fetch in kernel mode Jul 4 21:21:40 Fractal kernel: #PF: error_code(0x0010) - not-present page Jul 4 21:21:40 Fractal kernel: PGD 24af2f067 P4D 24af2f067 PUD 0 Jul 4 21:21:40 Fractal kernel: Oops: 0010 [#1] PREEMPT SMP NOPTI Jul 4 21:21:40 Fractal kernel: CPU: 5 PID: 1324 Comm: arc_prune Tainted: P O 6.8.12-Unraid #3 Jul 4 21:21:40 Fractal kernel: Hardware name: ASUS System Product Name/PRIME H510M-E, BIOS 2402 12/18/2023 Jul 4 21:21:40 Fractal kernel: RIP: 0010:0x200000002 Jul 4 21:21:40 Fractal kernel: Code: Unable to access opcode bytes at 0x1ffffffd8. Jul 4 21:21:40 Fractal kernel: RSP: 0018:ffffc9000098fd30 EFLAGS: 00010246 Jul 4 21:21:40 Fractal kernel: RAX: 0000000200000002 RBX: ffff8884f4070000 RCX: 0000000000000011 Jul 4 21:21:40 Fractal kernel: RDX: ffffffffa0cc54b8 RSI: ffffc9000098fd68 RDI: ffff8881c13ac580 Jul 4 21:21:40 Fractal kernel: RBP: ffffc9000098fdcc R08: 0000000000000000 R09: 00000000001d001c Jul 4 21:21:40 Fractal kernel: R10: 0000000000000000 R11: ffffc9002186fee8 R12: 000000000000bbda Jul 4 21:21:40 Fractal kernel: R13: ffff8881c13ac580 R14: ffff8881c84bfc00 R15: ffff88811176a100 Jul 4 21:21:40 Fractal kernel: FS: 0000000000000000(0000) GS:ffff88883e740000(0000) knlGS:0000000000000000 Jul 4 21:21:40 Fractal kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 4 21:21:40 Fractal kernel: CR2: 0000000200000002 CR3: 0000000251da4006 CR4: 00000000003706f0 Jul 4 21:21:40 Fractal kernel: Call Trace: Jul 4 21:21:40 Fractal kernel: <TASK> Jul 4 21:21:40 Fractal kernel: ? __die_body+0x1a/0x5c Jul 4 21:21:40 Fractal kernel: ? page_fault_oops+0x332/0x37f Jul 4 21:21:40 Fractal kernel: ? put_cpu_partial+0x62/0x8e Jul 4 21:21:40 Fractal kernel: ? spl_kmem_cache_free+0x3a/0x180 [spl] Jul 4 21:21:40 Fractal kernel: ? exc_page_fault+0xf9/0x116 Jul 4 21:21:40 Fractal kernel: ? asm_exc_page_fault+0x22/0x30 Jul 4 21:21:40 Fractal kernel: ? zfs_prune+0xec/0x2ec [zfs] Jul 4 21:21:40 Fractal kernel: ? zpl_prune_sb+0x32/0x50 [zfs] Jul 4 21:21:40 Fractal kernel: ? arc_prune_task+0x1b/0x2e [zfs] Jul 4 21:21:40 Fractal kernel: ? taskq_thread+0x2d4/0x3c1 [spl] Jul 4 21:21:40 Fractal kernel: ? __pfx_default_wake_function+0x10/0x10 Jul 4 21:21:40 Fractal kernel: ? __pfx_taskq_thread+0x10/0x10 [spl] Jul 4 21:21:40 Fractal kernel: ? kthread+0xf4/0xff Jul 4 21:21:40 Fractal kernel: ? __pfx_kthread+0x10/0x10 Jul 4 21:21:40 Fractal kernel: ? ret_from_fork+0x21/0x36 Jul 4 21:21:40 Fractal kernel: ? __pfx_kthread+0x10/0x10 Jul 4 21:21:40 Fractal kernel: ? ret_from_fork_asm+0x1b/0x30 Jul 4 21:21:40 Fractal kernel: </TASK> Jul 4 21:21:40 Fractal kernel: Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip_set nf_tables xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype br_netfilter bridge stp llc nfsd auth_rpcgss oid_registry lockd grace sunrpc bluetooth ecdh_generic ecc md_mod tcp_diag inet_diag ip6table_filter ip6_tables iptable_filter ip_tables x_tables efivarfs macvtap macvlan tap intel_rapl_common x86_pkg_temp_thermal i915 intel_powerclamp coretemp zfs(PO) kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 iosf_mbi sha256_ssse3 drm_buddy sha1_ssse3 ttm aesni_intel crypto_simd i2c_algo_bit cryptd drm_display_helper drm_kms_helper input_leds rapl spl(O) mei_hdcp mei_pxp intel_cstate wmi_bmof drm nvme intel_uncore e1000e hid_apple led_class nvme_core mei_me intel_gtt i2c_i801 agpgart i2c_smbus mei ahci Jul 4 21:21:40 Fractal kernel: i2c_core libahci thermal fan tpm_crb video tpm_tis tpm_tis_core tpm wmi backlight button acpi_tad acpi_pad Jul 4 21:21:40 Fractal kernel: CR2: 0000000200000002 Jul 4 21:21:40 Fractal kernel: ---[ end trace 0000000000000000 ]--- Jul 4 21:21:40 Fractal kernel: pstore: backend (efi_pstore) writing error (-5) Jul 4 21:21:40 Fractal kernel: RIP: 0010:0x200000002 Jul 4 21:21:40 Fractal kernel: Code: Unable to access opcode bytes at 0x1ffffffd8. Jul 4 21:21:40 Fractal kernel: RSP: 0018:ffffc9000098fd30 EFLAGS: 00010246 Jul 4 21:21:40 Fractal kernel: RAX: 0000000200000002 RBX: ffff8884f4070000 RCX: 0000000000000011 Jul 4 21:21:40 Fractal kernel: RDX: ffffffffa0cc54b8 RSI: ffffc9000098fd68 RDI: ffff8881c13ac580 Jul 4 21:21:40 Fractal kernel: RBP: ffffc9000098fdcc R08: 0000000000000000 R09: 00000000001d001c Jul 4 21:21:40 Fractal kernel: R10: 0000000000000000 R11: ffffc9002186fee8 R12: 000000000000bbda Jul 4 21:21:40 Fractal kernel: R13: ffff8881c13ac580 R14: ffff8881c84bfc00 R15: ffff88811176a100 Jul 4 21:21:40 Fractal kernel: FS: 0000000000000000(0000) GS:ffff88883e740000(0000) knlGS:0000000000000000 Jul 4 21:21:40 Fractal kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 4 21:21:40 Fractal kernel: CR2: 0000000200000002 CR3: 0000000251da4006 CR4: 00000000003706f0 Jul 4 21:21:40 Fractal kernel: note: arc_prune[1324] exited with irqs disabled Jul 4 21:21:50 Fractal kernel: veth95a1b70: renamed from eth0

1JorgeB commented 2 months ago

I'm a mod at the Unraid forums and we have seen multiple users with this issue with Docker on zfs since kernel 6.8 (openzfs 2.2.4-1), there was also one report with kernel 6.7 during beta testing, call traces look all very similar, some example in case it helps.

Jul  4 21:21:40 Fractal kernel: BUG: unable to handle page fault for address: 0000000200000002
Jul  4 21:21:40 Fractal kernel: #PF: supervisor instruction fetch in kernel mode
Jul  4 21:21:40 Fractal kernel: #PF: error_code(0x0010) - not-present page
Jul  4 21:21:40 Fractal kernel: PGD 24af2f067 P4D 24af2f067 PUD 0 
Jul  4 21:21:40 Fractal kernel: Oops: 0010 [#1] PREEMPT SMP NOPTI
Jul  4 21:21:40 Fractal kernel: CPU: 5 PID: 1324 Comm: arc_prune Tainted: P           O       6.8.12-Unraid #3
Jul  4 21:21:40 Fractal kernel: Hardware name: ASUS System Product Name/PRIME H510M-E, BIOS 2402 12/18/2023
Jul  4 21:21:40 Fractal kernel: RIP: 0010:0x200000002
Jul  4 21:21:40 Fractal kernel: Code: Unable to access opcode bytes at 0x1ffffffd8.
Jul  4 21:21:40 Fractal kernel: RSP: 0018:ffffc9000098fd30 EFLAGS: 00010246
Jul  4 21:21:40 Fractal kernel: RAX: 0000000200000002 RBX: ffff8884f4070000 RCX: 0000000000000011
Jul  4 21:21:40 Fractal kernel: RDX: ffffffffa0cc54b8 RSI: ffffc9000098fd68 RDI: ffff8881c13ac580
Jul  4 21:21:40 Fractal kernel: RBP: ffffc9000098fdcc R08: 0000000000000000 R09: 00000000001d001c
Jul  4 21:21:40 Fractal kernel: R10: 0000000000000000 R11: ffffc9002186fee8 R12: 000000000000bbda
Jul  4 21:21:40 Fractal kernel: R13: ffff8881c13ac580 R14: ffff8881c84bfc00 R15: ffff88811176a100
Jul  4 21:21:40 Fractal kernel: FS:  0000000000000000(0000) GS:ffff88883e740000(0000) knlGS:0000000000000000
Jul  4 21:21:40 Fractal kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul  4 21:21:40 Fractal kernel: CR2: 0000000200000002 CR3: 0000000251da4006 CR4: 00000000003706f0
Jul  4 21:21:40 Fractal kernel: Call Trace:
Jul  4 21:21:40 Fractal kernel: <TASK>
Jul  4 21:21:40 Fractal kernel: ? __die_body+0x1a/0x5c
Jul  4 21:21:40 Fractal kernel: ? page_fault_oops+0x332/0x37f
Jul  4 21:21:40 Fractal kernel: ? put_cpu_partial+0x62/0x8e
Jul  4 21:21:40 Fractal kernel: ? spl_kmem_cache_free+0x3a/0x180 [spl]
Jul  4 21:21:40 Fractal kernel: ? exc_page_fault+0xf9/0x116
Jul  4 21:21:40 Fractal kernel: ? asm_exc_page_fault+0x22/0x30
Jul  4 21:21:40 Fractal kernel: ? zfs_prune+0xec/0x2ec [zfs]
Jul  4 21:21:40 Fractal kernel: ? zpl_prune_sb+0x32/0x50 [zfs]
Jul  4 21:21:40 Fractal kernel: ? arc_prune_task+0x1b/0x2e [zfs]
Jul  4 21:21:40 Fractal kernel: ? taskq_thread+0x2d4/0x3c1 [spl]
Jul  4 21:21:40 Fractal kernel: ? __pfx_default_wake_function+0x10/0x10
Jul  4 21:21:40 Fractal kernel: ? __pfx_taskq_thread+0x10/0x10 [spl]
Jul  4 21:21:40 Fractal kernel: ? kthread+0xf4/0xff
Jul  4 21:21:40 Fractal kernel: ? __pfx_kthread+0x10/0x10
Jul  4 21:21:40 Fractal kernel: ? ret_from_fork+0x21/0x36
Jul  4 21:21:40 Fractal kernel: ? __pfx_kthread+0x10/0x10
Jul  4 21:21:40 Fractal kernel: ? ret_from_fork_asm+0x1b/0x30
Jul  4 21:21:40 Fractal kernel: </TASK>
Jul  4 21:21:40 Fractal kernel: Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip_set nf_tables xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype br_netfilter bridge stp llc nfsd auth_rpcgss oid_registry lockd grace sunrpc bluetooth ecdh_generic ecc md_mod tcp_diag inet_diag ip6table_filter ip6_tables iptable_filter ip_tables x_tables efivarfs macvtap macvlan tap intel_rapl_common x86_pkg_temp_thermal i915 intel_powerclamp coretemp zfs(PO) kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 iosf_mbi sha256_ssse3 drm_buddy sha1_ssse3 ttm aesni_intel crypto_simd i2c_algo_bit cryptd drm_display_helper drm_kms_helper input_leds rapl spl(O) mei_hdcp mei_pxp intel_cstate wmi_bmof drm nvme intel_uncore e1000e hid_apple led_class nvme_core mei_me intel_gtt i2c_i801 agpgart i2c_smbus mei ahci
Jul  4 21:21:40 Fractal kernel: i2c_core libahci thermal fan tpm_crb video tpm_tis tpm_tis_core tpm wmi backlight button acpi_tad acpi_pad
Jul  4 21:21:40 Fractal kernel: CR2: 0000000200000002
Jul  4 21:21:40 Fractal kernel: ---[ end trace 0000000000000000 ]---
Jul  4 21:21:40 Fractal kernel: pstore: backend (efi_pstore) writing error (-5)
Jul  4 21:21:40 Fractal kernel: RIP: 0010:0x200000002
Jul  4 21:21:40 Fractal kernel: Code: Unable to access opcode bytes at 0x1ffffffd8.
Jul  4 21:21:40 Fractal kernel: RSP: 0018:ffffc9000098fd30 EFLAGS: 00010246
Jul  4 21:21:40 Fractal kernel: RAX: 0000000200000002 RBX: ffff8884f4070000 RCX: 0000000000000011
Jul  4 21:21:40 Fractal kernel: RDX: ffffffffa0cc54b8 RSI: ffffc9000098fd68 RDI: ffff8881c13ac580
Jul  4 21:21:40 Fractal kernel: RBP: ffffc9000098fdcc R08: 0000000000000000 R09: 00000000001d001c
Jul  4 21:21:40 Fractal kernel: R10: 0000000000000000 R11: ffffc9002186fee8 R12: 000000000000bbda
Jul  4 21:21:40 Fractal kernel: R13: ffff8881c13ac580 R14: ffff8881c84bfc00 R15: ffff88811176a100
Jul  4 21:21:40 Fractal kernel: FS:  0000000000000000(0000) GS:ffff88883e740000(0000) knlGS:0000000000000000
Jul  4 21:21:40 Fractal kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul  4 21:21:40 Fractal kernel: CR2: 0000000200000002 CR3: 0000000251da4006 CR4: 00000000003706f0
Jul  4 21:21:40 Fractal kernel: note: arc_prune[1324] exited with irqs disabled
Jul  5 06:58:09 Sirius kernel: BUG: unable to handle page fault for address: 0000000200000000
Jul  5 06:58:09 Sirius kernel: #PF: supervisor instruction fetch in kernel mode
Jul  5 06:58:09 Sirius kernel: #PF: error_code(0x0010) - not-present page
Jul  5 06:58:09 Sirius kernel: PGD 8000000170a48067 P4D 8000000170a48067 PUD 0 
Jul  5 06:58:09 Sirius kernel: Oops: 0010 [#1] PREEMPT SMP PTI
Jul  5 06:58:09 Sirius kernel: CPU: 3 PID: 1079 Comm: arc_prune Tainted: P     U     O       6.8.12-Unraid #3
Jul  5 06:58:09 Sirius kernel: Hardware name: Gigabyte Technology Co., Ltd. C246N-WU2/C246N-WU2-CF, BIOS F2 11/09/2021
Jul  5 06:58:09 Sirius kernel: RIP: 0010:0x200000000
Jul  5 06:58:09 Sirius kernel: Code: Unable to access opcode bytes at 0x1ffffffd6.
Jul  5 06:58:09 Sirius kernel: RSP: 0018:ffffc900005c7d30 EFLAGS: 00010246
Jul  5 06:58:09 Sirius kernel: RAX: 0000000200000000 RBX: ffff8883cb33a000 RCX: 0000000000000011
Jul  5 06:58:09 Sirius kernel: RDX: ffffffffa0fe34b8 RSI: ffffc900005c7d68 RDI: ffff8882ad5d7380
Jul  5 06:58:09 Sirius kernel: RBP: ffffc900005c7dcc R08: 0000000000000000 R09: 0000000000000000
Jul  5 06:58:09 Sirius kernel: R10: 0000000000017c78 R11: 000000000000f40e R12: 000000000000cb06
Jul  5 06:58:09 Sirius kernel: R13: ffff8882ad5d7380 R14: ffff8881363196c0 R15: ffff8881060c8000
Jul  5 06:58:09 Sirius kernel: FS:  0000000000000000(0000) GS:ffff88884e580000(0000) knlGS:0000000000000000
Jul  5 06:58:09 Sirius kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul  5 06:58:09 Sirius kernel: CR2: 0000000200000000 CR3: 0000000170508004 CR4: 00000000003706f0
Jul  5 06:58:09 Sirius kernel: Call Trace:
Jul  5 06:58:09 Sirius kernel: <TASK>
Jul  5 06:58:09 Sirius kernel: ? __die_body+0x1a/0x5c
Jul  5 06:58:09 Sirius kernel: ? page_fault_oops+0x332/0x37f
Jul  5 06:58:09 Sirius kernel: ? call_rcu+0x530/0x5e6
Jul  5 06:58:09 Sirius kernel: ? exc_page_fault+0xf9/0x116
Jul  5 06:58:09 Sirius kernel: ? asm_exc_page_fault+0x22/0x30
Jul  5 06:58:09 Sirius kernel: ? zfs_prune+0xec/0x2ec [zfs]
Jul  5 06:58:09 Sirius kernel: ? __schedule+0x69c/0x6e8
Jul  5 06:58:09 Sirius kernel: ? zpl_prune_sb+0x32/0x50 [zfs]
Jul  5 06:58:09 Sirius kernel: ? arc_prune_task+0x1b/0x2e [zfs]
Jul  5 06:58:09 Sirius kernel: ? taskq_thread+0x2d4/0x3c1 [spl]
Jul  5 06:58:09 Sirius kernel: ? __pfx_default_wake_function+0x10/0x10
Jul  5 06:58:09 Sirius kernel: ? __pfx_taskq_thread+0x10/0x10 [spl]
Jul  5 06:58:09 Sirius kernel: ? kthread+0xf4/0xff
Jul  5 06:58:09 Sirius kernel: ? __pfx_kthread+0x10/0x10
Jul  5 06:58:09 Sirius kernel: ? ret_from_fork+0x21/0x36
Jul  5 06:58:09 Sirius kernel: ? __pfx_kthread+0x10/0x10
Jul  5 06:58:09 Sirius kernel: ? ret_from_fork_asm+0x1b/0x30
Jul  5 06:58:09 Sirius kernel: </TASK>
Jul  5 06:58:09 Sirius kernel: Modules linked in: bluetooth ecdh_generic ecc xt_nat xt_tcpudp veth xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype br_netfilter md_mod tcp_diag inet_diag kvmgt mdev i915 drm_buddy ttm drm_display_helper drm_kms_helper drm intel_gtt agpgart ip6table_filter ip6_tables iptable_filter ip_tables x_tables efivarfs af_packet 8021q garp mrp bridge stp llc bonding tls e1000e igb i2c_algo_bit intel_rapl_common iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp zfs(PO) kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd mei_pxp mei_hdcp rapl spl(O) gigabyte_wmi wmi_bmof i2c_i801 mei_me intel_cstate intel_uncore i2c_smbus nvme mei joydev i2c_core ahci input_leds led_class nvme_core libahci intel_pch_thermal fan thermal video tpm_crb wmi backlight tpm_tis tpm_tis_core tpm button acpi_pad [last unloaded: e1000e]
Jul  5 06:58:09 Sirius kernel: CR2: 0000000200000000
Jul  5 06:58:09 Sirius kernel: ---[ end trace 0000000000000000 ]---
Jul  5 06:58:09 Sirius kernel: RIP: 0010:0x200000000
Jul  5 06:58:09 Sirius kernel: Code: Unable to access opcode bytes at 0x1ffffffd6.
Jul  5 06:58:09 Sirius kernel: RSP: 0018:ffffc900005c7d30 EFLAGS: 00010246
Jul  5 06:58:09 Sirius kernel: RAX: 0000000200000000 RBX: ffff8883cb33a000 RCX: 0000000000000011
Jul  5 06:58:09 Sirius kernel: RDX: ffffffffa0fe34b8 RSI: ffffc900005c7d68 RDI: ffff8882ad5d7380
Jul  5 06:58:09 Sirius kernel: RBP: ffffc900005c7dcc R08: 0000000000000000 R09: 0000000000000000
Jul  5 06:58:09 Sirius kernel: R10: 0000000000017c78 R11: 000000000000f40e R12: 000000000000cb06
Jul  5 06:58:09 Sirius kernel: R13: ffff8882ad5d7380 R14: ffff8881363196c0 R15: ffff8881060c8000
Jul  5 06:58:09 Sirius kernel: FS:  0000000000000000(0000) GS:ffff88884e580000(0000) knlGS:0000000000000000
Jul  5 06:58:09 Sirius kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul  5 06:58:09 Sirius kernel: CR2: 0000000200000000 CR3: 0000000170508004 CR4: 00000000003706f0
Jul  5 06:58:09 Sirius kernel: note: arc_prune[1079] exited with irqs disabled
Jul  1 09:09:32 SofaKing kernel: BUG: unable to handle page fault for address: 0000000200000000
Jul  1 09:09:32 SofaKing kernel: #PF: supervisor instruction fetch in kernel mode
Jul  1 09:09:32 SofaKing kernel: #PF: error_code(0x0010) - not-present page
Jul  1 09:09:32 SofaKing kernel: PGD 0 P4D 0 
Jul  1 09:09:32 SofaKing kernel: Oops: 0010 [#1] PREEMPT SMP NOPTI
Jul  1 09:09:32 SofaKing kernel: CPU: 8 PID: 1547 Comm: arc_prune Tainted: P           O       6.8.12-Unraid #3
Jul  1 09:09:32 SofaKing kernel: Hardware name: System manufacturer System Product Name/TUF GAMING X570-PLUS (WI-FI), BIOS 4602 02/23/2023
Jul  1 09:09:32 SofaKing kernel: RIP: 0010:0x200000000
Jul  1 09:09:32 SofaKing kernel: Code: Unable to access opcode bytes at 0x1ffffffd6.
Jul  1 09:09:32 SofaKing kernel: RSP: 0018:ffffc90001267d30 EFLAGS: 00010246
Jul  1 09:09:32 SofaKing kernel: RAX: 0000000200000000 RBX: ffff8884254b0000 RCX: 0000000000000011
Jul  1 09:09:32 SofaKing kernel: RDX: ffffffffa43294b8 RSI: ffffc90001267d68 RDI: ffff88866784b780
Jul  1 09:09:32 SofaKing kernel: RBP: ffffc90001267dcc R08: 0000000000000000 R09: 00000000001d001c
Jul  1 09:09:32 SofaKing kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000c1d1
Jul  1 09:09:32 SofaKing kernel: R13: ffff88866784b780 R14: ffff88810160a0c0 R15: ffff88811025e180
Jul  1 09:09:32 SofaKing kernel: FS:  0000000000000000(0000) GS:ffff888feea00000(0000) knlGS:0000000000000000
Jul  1 09:09:32 SofaKing kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul  1 09:09:32 SofaKing kernel: CR2: 0000000200000000 CR3: 0000000210406000 CR4: 0000000000750ef0
Jul  1 09:09:32 SofaKing kernel: PKRU: 55555554
Jul  1 09:09:32 SofaKing kernel: Call Trace:
Jul  1 09:09:32 SofaKing kernel: <TASK>
Jul  1 09:09:32 SofaKing kernel: ? __die_body+0x1a/0x5c
Jul  1 09:09:32 SofaKing kernel: ? page_fault_oops+0x332/0x37f
Jul  1 09:09:32 SofaKing kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Jul  1 09:09:32 SofaKing kernel: ? spl_kmem_cache_free+0x3a/0x180 [spl]
Jul  1 09:09:32 SofaKing kernel: ? exc_page_fault+0xf9/0x116
Jul  1 09:09:32 SofaKing kernel: ? asm_exc_page_fault+0x22/0x30
Jul  1 09:09:32 SofaKing kernel: ? zfs_prune+0xef/0x2ec [zfs]
Jul  1 09:09:32 SofaKing kernel: ? zpl_prune_sb+0x32/0x50 [zfs]
Jul  1 09:09:32 SofaKing kernel: ? arc_prune_task+0x1e/0x2e [zfs]
Jul  1 09:09:32 SofaKing kernel: ? taskq_thread+0x2d7/0x3c1 [spl]
Jul  1 09:09:32 SofaKing kernel: ? __pfx_default_wake_function+0x10/0x10
Jul  1 09:09:32 SofaKing kernel: ? __pfx_taskq_thread+0x10/0x10 [spl]
Jul  1 09:09:32 SofaKing kernel: ? kthread+0xf7/0xff
Jul  1 09:09:32 SofaKing kernel: ? __pfx_kthread+0x10/0x10
Jul  1 09:09:32 SofaKing kernel: ? ret_from_fork+0x24/0x36
Jul  1 09:09:32 SofaKing kernel: ? __pfx_kthread+0x10/0x10
Jul  1 09:09:32 SofaKing kernel: ? ret_from_fork_asm+0x1b/0x30
Jul  1 09:09:32 SofaKing kernel: </TASK>
Jul  1 09:09:32 SofaKing kernel: Modules linked in: xt_nat veth nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter xt_CHECKSUM xt_conntrack ipt_REJECT nf_reject_ipv4 ip6table_mangle iptable_mangle vhost_net vhost vhost_iotlb nvidia_uvm(PO) nfsd auth_rpcgss oid_registry lockd grace sunrpc md_mod xt_tcpudp xt_mark tun nf_tables nfnetlink ip6table_nat tcp_diag inet_diag nct6775 nct6775_core hwmon_vid iptable_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 wireguard curve25519_x86_64 libcurve25519_generic libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel libchacha ip6table_filter ip6_tables iptable_filter ip_tables x_tables efivarfs 8021q garp mrp macvtap macvlan tap bridge stp llc atlantic r8169 realtek edac_mce_amd edac_core intel_rapl_common iosf_mbi nvidia_drm(PO) nvidia_modeset(PO) kvm_amd zfs(PO) nvidia(PO) video kvm drm_kms_helper spl(O) crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel btusb sha512_ssse3 drm sha256_ssse3 sha1_ssse3 btrtl aesni_intel btbcm
Jul  1 09:09:32 SofaKing kernel: crypto_simd btintel cp210x cryptd wmi_bmof rapl joydev bluetooth input_leds acpi_cpufreq i2c_piix4 backlight k10temp usbserial nvme ccp ahci i2c_core ecdh_generic libahci ecc nvme_core led_class wmi tpm_crb tpm_tis tpm_tis_core tpm button [last unloaded: atlantic]
Jul  1 09:09:32 SofaKing kernel: CR2: 0000000200000000
Jul  1 09:09:32 SofaKing kernel: ---[ end trace 0000000000000000 ]---
Jul  1 09:09:32 SofaKing kernel: RIP: 0010:0x200000000
Jul  1 09:09:32 SofaKing kernel: Code: Unable to access opcode bytes at 0x1ffffffd6.
Jul  1 09:09:32 SofaKing kernel: RSP: 0018:ffffc90001267d30 EFLAGS: 00010246
Jul  1 09:09:32 SofaKing kernel: RAX: 0000000200000000 RBX: ffff8884254b0000 RCX: 0000000000000011
Jul  1 09:09:32 SofaKing kernel: RDX: ffffffffa43294b8 RSI: ffffc90001267d68 RDI: ffff88866784b780
Jul  1 09:09:32 SofaKing kernel: RBP: ffffc90001267dcc R08: 0000000000000000 R09: 00000000001d001c
Jul  1 09:09:32 SofaKing kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000c1d1
Jul  1 09:09:32 SofaKing kernel: R13: ffff88866784b780 R14: ffff88810160a0c0 R15: ffff88811025e180
Jul  1 09:09:32 SofaKing kernel: FS:  0000000000000000(0000) GS:ffff888feea00000(0000) knlGS:0000000000000000
Jul  1 09:09:32 SofaKing kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul  1 09:09:32 SofaKing kernel: CR2: 0000000200000000 CR3: 0000000210406000 CR4: 0000000000750ef0
Jul  1 09:09:32 SofaKing kernel: PKRU: 55555554
Jul  1 09:09:32 SofaKing kernel: note: arc_prune[1547] exited with irqs disabled
Jun 30 03:00:45 unRAID kernel: BUG: unable to handle page fault for address: 0000000200000002
Jun 30 03:00:45 unRAID kernel: #PF: supervisor instruction fetch in kernel mode
Jun 30 03:00:45 unRAID kernel: #PF: error_code(0x0010) - not-present page
Jun 30 03:00:45 unRAID kernel: PGD c1420c067 P4D c1420c067 PUD 0 
Jun 30 03:00:45 unRAID kernel: Oops: 0010 [#1] PREEMPT SMP NOPTI
Jun 30 03:00:45 unRAID kernel: CPU: 3 PID: 9677 Comm: arc_prune Tainted: P           O       6.8.12-Unraid #3
Jun 30 03:00:45 unRAID kernel: Hardware name: ASRock Z690 PG Riptide/Z690 PG Riptide, BIOS 18.04 06/07/2024
Jun 30 03:00:45 unRAID kernel: RIP: 0010:0x200000002
Jun 30 03:00:45 unRAID kernel: Code: Unable to access opcode bytes at 0x1ffffffd8.
Jun 30 03:00:45 unRAID kernel: RSP: 0018:ffffc9000086bd30 EFLAGS: 00010246
Jun 30 03:00:45 unRAID kernel: RAX: 0000000200000002 RBX: ffff8885cf7f4000 RCX: 0000000000000011
Jun 30 03:00:45 unRAID kernel: RDX: ffffffffa0fd84b8 RSI: ffffc9000086bd68 RDI: ffff888294f3f780
Jun 30 03:00:45 unRAID kernel: RBP: ffffc9000086bdcc R08: 0000000000000000 R09: ffff88810b63f548
Jun 30 03:00:45 unRAID kernel: R10: 000000000000065a R11: 0000000000000699 R12: 000000000002040e
Jun 30 03:00:45 unRAID kernel: R13: ffff888294f3f780 R14: ffff88813cdfbe00 R15: ffff88813beba080
Jun 30 03:00:45 unRAID kernel: FS:  0000000000000000(0000) GS:ffff88904f2c0000(0000) knlGS:0000000000000000
Jun 30 03:00:45 unRAID kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 30 03:00:45 unRAID kernel: CR2: 0000000200000002 CR3: 00000002c443a000 CR4: 0000000000752ef0
Jun 30 03:00:45 unRAID kernel: PKRU: 55555554
Jun 30 03:00:45 unRAID kernel: Call Trace:
Jun 30 03:00:45 unRAID kernel: <TASK>
Jun 30 03:00:45 unRAID kernel: ? __die_body+0x1a/0x5c
Jun 30 03:00:45 unRAID kernel: ? page_fault_oops+0x332/0x37f
Jun 30 03:00:45 unRAID kernel: ? call_rcu+0x530/0x5e6
Jun 30 03:00:45 unRAID kernel: ? exc_page_fault+0xf9/0x116
Jun 30 03:00:45 unRAID kernel: ? asm_exc_page_fault+0x22/0x30
Jun 30 03:00:45 unRAID kernel: ? zfs_prune+0xec/0x2ec [zfs]
Jun 30 03:00:45 unRAID kernel: ? autoremove_wake_function+0xe/0x33
Jun 30 03:00:45 unRAID kernel: ? zpl_prune_sb+0x32/0x50 [zfs]
Jun 30 03:00:45 unRAID kernel: ? arc_prune_task+0x1b/0x2e [zfs]
Jun 30 03:00:45 unRAID kernel: ? taskq_thread+0x2d4/0x3c1 [spl]
Jun 30 03:00:45 unRAID kernel: ? __pfx_default_wake_function+0x10/0x10
Jun 30 03:00:45 unRAID kernel: ? __pfx_taskq_thread+0x10/0x10 [spl]
Jun 30 03:00:45 unRAID kernel: ? kthread+0xf4/0xff
Jun 30 03:00:45 unRAID kernel: ? __pfx_kthread+0x10/0x10
Jun 30 03:00:45 unRAID kernel: ? ret_from_fork+0x21/0x36
Jun 30 03:00:45 unRAID kernel: ? __pfx_kthread+0x10/0x10
Jun 30 03:00:45 unRAID kernel: ? ret_from_fork_asm+0x1b/0x30
Jun 30 03:00:45 unRAID kernel: </TASK>
Jun 30 03:00:45 unRAID kernel: Modules linked in: veth xt_CHECKSUM ipt_REJECT nf_reject_ipv4 xt_nat xt_tcpudp ip6table_mangle ip6table_nat iptable_mangle vhost_net tun vhost vhost_iotlb xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype br_netfilter bridge dm_crypt dm_mod nfsd auth_rpcgss oid_registry lockd grace sunrpc md_mod zfs(PO) spl(O) tcp_diag inet_diag nct6775 nct6775_core hwmon_vid ip6table_filter ip6_tables iptable_filter ip_tables x_tables efivarfs macvtap macvlan tap 8021q garp mrp stp llc mlx4_en xe drm_gpuvm drm_exec gpu_sched drm_ttm_helper drm_suballoc_helper x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel i915 kvm drm_buddy crct10dif_pclmul crc32_pclmul ttm crc32c_intel ghash_clmulni_intel sha512_ssse3 sha256_ssse3 i2c_algo_bit sha1_ssse3 drm_display_helper aesni_intel crypto_simd cryptd drm_kms_helper mei_hdcp mei_pxp rapl drm processor_thermal_device_pci i2c_i801 mei_me processor_thermal_device intel_cstate
Jun 30 03:00:45 unRAID kernel: processor_thermal_wt_hint wmi_bmof mlx4_core intel_gtt intel_uncore processor_thermal_rfim nvme processor_thermal_rapl intel_rapl_common processor_thermal_wt_req i2c_smbus ahci agpgart mei input_leds processor_thermal_power_floor nvme_core libahci joydev processor_thermal_mbox led_class i2c_core int340x_thermal_zone iosf_mbi tpm_crb tpm_tis video tpm_tis_core wmi tpm int3400_thermal backlight acpi_thermal_rel acpi_pad acpi_tad button
Jun 30 03:00:45 unRAID kernel: CR2: 0000000200000002
Jun 30 03:00:45 unRAID kernel: ---[ end trace 0000000000000000 ]---
Jun 30 03:00:45 unRAID kernel: RIP: 0010:0x200000002
Jun 30 03:00:45 unRAID kernel: Code: Unable to access opcode bytes at 0x1ffffffd8.
Jun 30 03:00:45 unRAID kernel: RSP: 0018:ffffc9000086bd30 EFLAGS: 00010246
Jun 30 03:00:45 unRAID kernel: RAX: 0000000200000002 RBX: ffff8885cf7f4000 RCX: 0000000000000011
Jun 30 03:00:45 unRAID kernel: RDX: ffffffffa0fd84b8 RSI: ffffc9000086bd68 RDI: ffff888294f3f780
Jun 30 03:00:45 unRAID kernel: RBP: ffffc9000086bdcc R08: 0000000000000000 R09: ffff88810b63f548
Jun 30 03:00:45 unRAID kernel: R10: 000000000000065a R11: 0000000000000699 R12: 000000000002040e
Jun 30 03:00:45 unRAID kernel: R13: ffff888294f3f780 R14: ffff88813cdfbe00 R15: ffff88813beba080
Jun 30 03:00:45 unRAID kernel: FS:  0000000000000000(0000) GS:ffff88904f2c0000(0000) knlGS:0000000000000000
Jun 30 03:00:45 unRAID kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 30 03:00:45 unRAID kernel: CR2: 0000000200000002 CR3: 00000002c443a000 CR4: 0000000000752ef0
Jun 30 03:00:45 unRAID kernel: PKRU: 55555554
Jun 30 03:00:45 unRAID kernel: note: arc_prune[9677] exited with irqs disabled
webdock-io commented 1 month ago

We've been hitting what looks to be this issue ever since we launched our new infrastructure all running on the latest Ubuntu Kernel and ZFS. Once we had migrated some hundreds of container workloads we started experiencing crashes. We've been very unfortunate in our crash dump collection, but we've ascertained that the crashes we are seeing is very similar to this, please see the LXCFS issue linked above.

Symptoms as we see them:

  1. Triggered by read-heavy workloads. The more read you have the more likely this is to happen.
  2. Upping the ARC cache to the maximum we are able to, has greatly mitigated these crashes. Seems like reading from ARC in RAM prevents this from happening to some extent. (We also spread out read-heavy workloads as much as we could to different hosts, which also seems to have helped)
  3. The issue seems to affect LXCFS, likely due to memory getting corrupted

ZFS version where we've seen this is zfs-2.2.2-0ubuntu9 to zfs-2.2.4-1 / zfs-kmod-2.2.4-1 (Zabbly) Kernel versions we've seen this is: 6.8.0-31-generic #31-Ubuntu to 6.9.10-zabbly+ #ubuntu24.04

These workloads were all stable for years on older kernels.

This is a real issue and I would not be surprised to learn that a lot of zfs users out there are being affected by this now. It took us a long time to track down the source of our crashes, and I expect others may be in the same situation. I believe this issue warrants immediate attention, especially since upgrading to the latest mainline-ish kernel and zfs does not seem to resolve this.

mihalicyn commented 1 month ago

Taking into account that this crash happens in a shrinkers-related code I can make a wild guess that this issue should be provokable by something like echo 3 > /proc/sys/vm/drop_caches (don't try on production systems!).

maxpoulin64 commented 1 month ago

Taking into account that this crash happens in a shrinkers-related code I can make a wild guess that this issue should be provokable by something like echo 3 > /proc/sys/vm/drop_caches (don't try on production systems!).

I tried that and it did not crash my system or trigger the issue. AFAIK the ZFS ARC is separate.

My guess would be it's hitting a race condition of sorts on heavy IO where it has to be evicting a lot out of ARC.

danieldietsch commented 1 month ago

I have a similar issue with kernel 6.8.12, docker 27.0.3, and zfs 2.2.4 (zfs-2.2.4-r0-gentoo) on Gentoo. My trace does not include arc_prune but zfs_prune.

[Thu Jul 25 10:11:50 2024] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[Thu Jul 25 10:11:50 2024] BUG: unable to handle page fault for address: ffff888209535180
[Thu Jul 25 10:11:50 2024] #PF: supervisor instruction fetch in kernel mode
[Thu Jul 25 10:11:50 2024] #PF: error_code(0x0011) - permissions violation
[Thu Jul 25 10:11:50 2024] PGD 3001067 P4D 3001067 PUD 81f5f2067 PMD 40b61e063 PTE 8000000209535063
[Thu Jul 25 10:11:50 2024] Oops: 0011 [#1] SMP PTI
[Thu Jul 25 10:11:50 2024] CPU: 2 PID: 3297 Comm: arc_prune Tainted: P           O       6.8.12-gentoo #1
[Thu Jul 25 10:11:50 2024] Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./Z77X-D3H, BIOS F18i 01/06/2014
[Thu Jul 25 10:11:50 2024] RIP: 0010:0xffff888209535180
[Thu Jul 25 10:11:50 2024] Code: 00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 <00> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 85 e8 4b 8d 82 88
[Thu Jul 25 10:11:50 2024] RSP: 0018:ffffc900004ebd20 EFLAGS: 00010246
[Thu Jul 25 10:11:50 2024] RAX: ffff8881031a5c40 RBX: ffff8883be0ce000 RCX: 00000000031a5c31
[Thu Jul 25 10:11:50 2024] RDX: 0000000000000000 RSI: ffffc900004ebd68 RDI: ffff88828d4be880
[Thu Jul 25 10:11:50 2024] RBP: ffff88828d4be880 R08: 0000000000000000 R09: 0000000000000000
[Thu Jul 25 10:11:50 2024] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000206
[Thu Jul 25 10:11:50 2024] R13: ffffc900004ebdcc R14: ffff8883be0ce0f8 R15: ffff888108505200
[Thu Jul 25 10:11:50 2024] FS:  0000000000000000(0000) GS:ffff8887ff300000(0000) knlGS:0000000000000000
[Thu Jul 25 10:11:50 2024] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Thu Jul 25 10:11:50 2024] CR2: ffff888209535180 CR3: 000000000262e001 CR4: 00000000001706f0
[Thu Jul 25 10:11:50 2024] Call Trace:
[Thu Jul 25 10:11:50 2024]  <TASK>
[Thu Jul 25 10:11:50 2024]  ? __die+0x1a/0x60
[Thu Jul 25 10:11:50 2024]  ? page_fault_oops+0x158/0x430
[Thu Jul 25 10:11:50 2024]  ? search_extable+0x22/0x30
[Thu Jul 25 10:11:50 2024]  ? search_module_extables+0x9/0x30
[Thu Jul 25 10:11:50 2024]  ? fixup_exception+0x1d/0x240
[Thu Jul 25 10:11:50 2024]  ? exc_page_fault+0x28a/0x580
[Thu Jul 25 10:11:50 2024]  ? asm_exc_page_fault+0x22/0x30
[Thu Jul 25 10:11:50 2024]  ? zfs_prune+0x9b/0x3f0 [zfs]
[Thu Jul 25 10:11:50 2024]  ? __switch_to_asm+0x3a/0x60
[Thu Jul 25 10:11:50 2024]  ? __switch_to_asm+0x34/0x60
[Thu Jul 25 10:11:50 2024]  ? zpl_prune_sb+0x2f/0x1780 [zfs]
[Thu Jul 25 10:11:50 2024]  ? arc_getbuf_func+0x26/0x340 [zfs]
[Thu Jul 25 10:11:50 2024]  ? taskq_dispatch+0x48f/0x680 [spl]
[Thu Jul 25 10:11:50 2024]  ? wake_up_state+0x10/0x10
[Thu Jul 25 10:11:50 2024]  ? taskq_dispatch+0x240/0x680 [spl]
[Thu Jul 25 10:11:50 2024]  ? kthread+0xc4/0xf0
[Thu Jul 25 10:11:50 2024]  ? kthread_complete_and_exit+0x20/0x20
[Thu Jul 25 10:11:50 2024]  ? ret_from_fork+0x28/0x40
[Thu Jul 25 10:11:50 2024]  ? kthread_complete_and_exit+0x20/0x20
[Thu Jul 25 10:11:50 2024]  ? ret_from_fork_asm+0x11/0x20
[Thu Jul 25 10:11:50 2024]  </TASK>
[Thu Jul 25 10:11:50 2024] Modules linked in: wireguard libchacha20poly1305 chacha_x86_64 poly1305_x86_64 br_netfilter em28xx_rc si2157 si2168 bridge stp llc xt_MASQUERADE xt_addrtype zfs(PO) spl(O) xt_LOG nf_log_syslog ip6t_REJECT nf_reject_ipv6 em28xx_alsa ip6table_filter ip6_tables drxk em28xx_dvb snd_hda_codec_hdmi snd_hda_codec_via snd_hda_codec_generic led_class snd_hda_intel adm1021 snd_intel_dspcfg snd_hda_codec em28xx x86_pkg_temp_thermal snd_hda_core i915 cdc_acm tveeprom snd_pcm atl1c mpt3sas snd_timer i2c_algo_bit raid_class drm_buddy scsi_transport_sas drm_display_helper fan ttm video evdev wmi
[Thu Jul 25 10:11:50 2024] CR2: ffff888209535180
[Thu Jul 25 10:11:50 2024] ---[ end trace 0000000000000000 ]---
[Thu Jul 25 10:11:50 2024] RIP: 0010:0xffff888209535180
[Thu Jul 25 10:11:50 2024] Code: 00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 <00> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 85 e8 4b 8d 82 88
[Thu Jul 25 10:11:50 2024] RSP: 0018:ffffc900004ebd20 EFLAGS: 00010246
[Thu Jul 25 10:11:50 2024] RAX: ffff8881031a5c40 RBX: ffff8883be0ce000 RCX: 00000000031a5c31
[Thu Jul 25 10:11:50 2024] RDX: 0000000000000000 RSI: ffffc900004ebd68 RDI: ffff88828d4be880
[Thu Jul 25 10:11:50 2024] RBP: ffff88828d4be880 R08: 0000000000000000 R09: 0000000000000000
[Thu Jul 25 10:11:50 2024] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000206
[Thu Jul 25 10:11:50 2024] R13: ffffc900004ebdcc R14: ffff8883be0ce0f8 R15: ffff888108505200
[Thu Jul 25 10:11:50 2024] FS:  0000000000000000(0000) GS:ffff8887ff300000(0000) knlGS:0000000000000000
[Thu Jul 25 10:11:50 2024] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Thu Jul 25 10:11:50 2024] CR2: ffff888209535180 CR3: 000000000262e001 CR4: 00000000001706f0
[Thu Jul 25 10:11:50 2024] note: arc_prune[3297] exited with irqs disabled
andrebrait commented 1 month ago

@1JorgeB any chance switching storage drivers might be a reliable workaround until this gets resolved, versus using a btrfs image?

Xynonners commented 1 month ago

same issue here, on 6.8.9. crashes after an AI training workload for a few hours.

will revert to 6.6.

EDIT: 6.6 is stable (am not using docker, just standard filesystem reading and writing).

satmandu commented 1 month ago

FYI OpenZFS has supported docker's overlay2 storage driver since 2.2. See https://github.com/moby/moby/issues/46337#issuecomment-1899224961

I gave up on the docker zfs storage driver some time ago, as it was pretty buggy.

If you are starting docker with systemd you can modify the startup line to be:

ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock -s overlay2 

Using overlay2 I don't have any issues loading docker run -it --rm -p 8080:8080 --gpus all --name localai quay.io/go-skynet/local-ai:latest-aio-gpu-hipblas using a kernel newer than 6.9 using zfs built from https://github.com/openzfs/zfs/pull/16359 . In fairness I'm also adding some patches from PRs on top of that so that I can be on kernel 6.11.0-rc1...

satmandu commented 1 month ago

But maybe also https://github.com/openzfs/zfs/pull/16401 and https://github.com/openzfs/zfs/pull/16404 might help track down the issue?

putnam commented 1 week ago

Similar situation here, this time with 6.10.4 and zfs 2.2.5-1 on Debian.

This occurred during a docker compose pull while some image layers were being extracted in tandem. I am also using the zfs driver.

I was forced to power cycle after this, which incidentally upgraded my kernel to 6.10.6, and the same pull succeeded fine afterward. But the ARC conditions would also be entirely different after a fresh boot so I doubt the kernel upgrade mattered. Just some info.

FWIW, my kernel is in lockdown mode due to secure boot.

Aug 29 04:25:08 server kernel: PGD 0 P4D 0
Aug 29 04:25:08 server kernel: Oops: Oops: 0010 [#1] PREEMPT SMP NOPTI
Aug 29 04:25:08 server kernel: CPU: 8 PID: 1168 Comm: arc_prune Tainted: P           O       6.10.4-amd64 #1  Debian 6.10.4-1
Aug 29 04:25:08 server kernel: Hardware name: Supermicro Super Server/H12SSL-CT, BIOS 2.8 02/27/2024
Aug 29 04:25:08 server kernel: RIP: 0010:0x200000000
Aug 29 04:25:08 server kernel: Code: Unable to access opcode bytes at 0x1ffffffd6.
Aug 29 04:25:08 server kernel: RSP: 0018:ffffbdfb4be4fce8 EFLAGS: 00010246
Aug 29 04:25:08 server kernel: RAX: 0000000200000000 RBX: ffffbdfb4be4fd9c RCX: 0000000000000000
Aug 29 04:25:08 server kernel: RDX: 0000000000000000 RSI: ffffbdfb4be4fd38 RDI: ffff961b87569400
Aug 29 04:25:08 server kernel: RBP: 0000000000023f7e R08: ffff96279b820000 R09: ffff9618fdb7da28
Aug 29 04:25:08 server kernel: R10: 0000000000000001 R11: 0000000000000000 R12: ffff961b87569400
Aug 29 04:25:08 server kernel: R13: ffff96190575ff70 R14: ffff9618fdb7da90 R15: ffff96279b8200f8
Aug 29 04:25:08 server kernel: FS:  0000000000000000(0000) GS:ffff96378dc00000(0000) knlGS:0000000000000000
Aug 29 04:25:08 server kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 29 04:25:08 server kernel: CR2: 0000000200000000 CR3: 0000000f6e890000 CR4: 0000000000350ef0
Aug 29 04:25:08 server kernel: Call Trace:
Aug 29 04:25:08 server kernel:  <TASK>
Aug 29 04:25:08 server kernel:  ? __die+0x23/0x70
Aug 29 04:25:08 server kernel:  ? page_fault_oops+0x173/0x5a0
Aug 29 04:25:08 server kernel:  ? spl_kmem_cache_free+0x130/0x1e0 [spl]
Aug 29 04:25:08 server kernel:  ? exc_page_fault+0x7e/0x180
Aug 29 04:25:08 server kernel:  ? asm_exc_page_fault+0x26/0x30
Aug 29 04:25:08 server kernel:  ? zfs_prune+0xba/0x4e0 [zfs]
Aug 29 04:25:08 server kernel:  ? finish_task_switch.isra.0+0x97/0x2c0
Aug 29 04:25:08 server kernel:  ? srso_return_thunk+0x5/0x5f
Aug 29 04:25:08 server kernel:  ? zpl_prune_sb+0x38/0x60 [zfs]
Aug 29 04:25:08 server kernel:  ? arc_prune_task+0x22/0x40 [zfs]
Aug 29 04:25:08 server kernel:  ? taskq_thread+0x2ba/0x500 [spl]
Aug 29 04:25:08 server kernel:  ? __pfx_default_wake_function+0x10/0x10
Aug 29 04:25:08 server kernel:  ? __pfx_taskq_thread+0x10/0x10 [spl]
Aug 29 04:25:08 server kernel:  ? kthread+0xd2/0x100
Aug 29 04:25:08 server kernel:  ? __pfx_kthread+0x10/0x10
Aug 29 04:25:08 server kernel:  ? ret_from_fork+0x34/0x50
Aug 29 04:25:08 server kernel:  ? __pfx_kthread+0x10/0x10
Aug 29 04:25:08 server kernel:  ? ret_from_fork_asm+0x1a/0x30
Aug 29 04:25:08 server kernel:  </TASK>
Aug 29 04:25:08 server kernel: Modules linked in: cpuid udp_diag tcp_diag inet_diag wireguard libchacha20poly1305 chacha_x86_64 poly1305_x86_64 curve25519_x86_64 libcurve25519_generic libchacha ip>
Aug 29 04:25:08 server kernel:  watchdog k10temp ipmi_msghandler joydev evdev sg nvme_fabrics drm efi_pstore configfs nfnetlink ip_tables x_tables autofs4 zfs(PO) spl(O) efivarfs raid10 raid0 raid>
Aug 29 04:25:08 server kernel: CR2: 0000000200000000
Aug 29 04:25:08 server kernel: ---[ end trace 0000000000000000 ]---
Aug 29 04:25:08 server kernel: RIP: 0010:0x200000000
Aug 29 04:25:08 server kernel: Code: Unable to access opcode bytes at 0x1ffffffd6.
Aug 29 04:25:08 server kernel: RSP: 0018:ffffbdfb4be4fce8 EFLAGS: 00010246
Aug 29 04:25:08 server kernel: RAX: 0000000200000000 RBX: ffffbdfb4be4fd9c RCX: 0000000000000000
Aug 29 04:25:08 server kernel: RDX: 0000000000000000 RSI: ffffbdfb4be4fd38 RDI: ffff961b87569400
Aug 29 04:25:08 server kernel: RBP: 0000000000023f7e R08: ffff96279b820000 R09: ffff9618fdb7da28
Aug 29 04:25:08 server kernel: R10: 0000000000000001 R11: 0000000000000000 R12: ffff961b87569400
Aug 29 04:25:08 server kernel: R13: ffff96190575ff70 R14: ffff9618fdb7da90 R15: ffff96279b8200f8
Aug 29 04:25:08 server kernel: FS:  0000000000000000(0000) GS:ffff96378dc00000(0000) knlGS:0000000000000000
Aug 29 04:25:08 server kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 29 04:25:08 server kernel: CR2: 0000000200000000 CR3: 0000000f6e890000 CR4: 0000000000350ef0
Aug 29 04:25:08 server kernel: note: arc_prune[1168] exited with irqs disabled
maxpoulin64 commented 2 days ago

Still reproducing in ZFS 2.2.6 and kernel 6.10.8-zen1-1-zen.