openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.45k stars 1.73k forks source link

zcommon causes a kernel crash RHEL9.4 #16540

Open khain0 opened 3 days ago

khain0 commented 3 days ago

System information

Type | Version/Name Red Hat Enterprise Linux 9.4 5.14.0-427.22.1.el9_4.x86_64 Distribution Name: Red Hat Enterprise Linux Distribution Version: 9.4 Kernel Version: Red Hat Enterprise Linux Architecture: x86_64 OpenZFS Version: zfs-dkms 2.1.15-3

Command to find OpenZFS version: zfs-2.1.15-3 zfs-kmod-2.1.15-3

Describe the problem you're observing

Zcommon caused a kernel crash.

Describe how to reproduce the problem

Wait for kernel crash

Include any warning/errors/backtraces from the system logs

Kernel crash log

[6972591.685926] general protection fault, maybe for address 0xff43024862a6b000: 0000 [#1] PREEMPT SMP NOPTI
[6972591.687227] CPU: 27 PID: 2524776 Comm: z_wr_iss Kdump: loaded Tainted: P           OE  X  -------  ---  5.14.0-427.22.1.el9_4.x86_64 #1
[6972591.689706] Hardware name: Dell Inc. PowerEdge XE8640/0TVHHH, BIOS 2.0.3 05/15/2024
[6972591.690719] RIP: 0010:kfpu_end+0x34/0xa0 [zcommon]
[6972591.691725] Code: 00 65 48 8b 04 25 28 00 00 00 48 89 44 24 08 31 c0 65 8b 05 4a c2 85 3f 48 98 48 8b 0c c2 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 19 fb 65 ff 0d 29 c2 85 3f 75 05 0f 1f 44 00 00 48 8b 44 24
[6972591.693723] RSP: 0018:ff4781d383f5f9a0 EFLAGS: 00010046
[6972591.694708] RAX: 00000000ffffffff RBX: ff4781d383f5faa0 RCX: ff43024862a6b000
[6972591.695684] RDX: 00000000ffffffff RSI: ff430229a0368000 RDI: ff4781d383f5fac0
[6972591.696645] RBP: 0000000000001000 R08: ff4781d383f5faa0 R09: ff430229a03687e6
[6972591.697593] R10: ff430229a0368000 R11: 00000000000007e6 R12: ff430229a0368000
[6972591.698529] R13: 0000000000001000 R14: 0000000000000000 R15: 0000000000000008
[6972591.699452] FS:  0000000000000000(0000) GS:ff430282ff340000(0000) knlGS:0000000000000000
[6972591.700366] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[6972591.701272] CR2: 0000000012808000 CR3: 000000fdb80c4003 CR4: 0000000000771ee0
[6972591.702169] PKRU: 55555554
[6972591.703046] Call Trace:
[6972591.703901]  <TASK>
[6972591.704741]  ? show_trace_log_lvl+0x1c4/0x2df
[6972591.705580]  ? show_trace_log_lvl+0x1c4/0x2df
[6972591.706406]  ? abd_fletcher_4_iter+0x64/0xc0 [zcommon]
[6972591.707217]  ? __die_body.cold+0x8/0xd
[6972591.708005]  ? die_addr+0x39/0x60
[6972591.708801]  ? exc_general_protection+0x1aa/0x400
[6972591.709549]  ? asm_exc_general_protection+0x22/0x30
[6972591.710291]  ? kfpu_end+0x34/0xa0 [zcommon]
[6972591.711026]  abd_fletcher_4_iter+0x64/0xc0 [zcommon]
[6972591.711742]  abd_iterate_func.part.0+0xbd/0x1c0 [zfs]
[6972591.712531]  ? __pfx_abd_fletcher_4_iter+0x10/0x10 [zcommon]
[6972591.713213]  abd_fletcher_4_native+0x7c/0xc0 [zfs]
[6972591.713995]  ? find_busiest_group+0x11d/0x240
[6972591.714645]  zio_checksum_compute+0xc7/0x3f0 [zfs]
[6972591.715374]  ? __kmem_cache_alloc_node+0x1c7/0x2d0
[6972591.715994]  ? spl_kmem_alloc+0xb2/0x100 [spl]
[6972591.716604]  ? spl_kmem_alloc+0xb2/0x100 [spl]
[6972591.717190]  ? __kmalloc_node+0x4e/0x140
[6972591.717755]  ? spl_kmem_alloc+0xb2/0x100 [spl]
[6972591.718309]  ? zio_write_compress+0x768/0x9c0 [zfs]
[6972591.718932]  zio_checksum_generate+0x4c/0x70 [zfs]
[6972591.719533]  zio_execute+0x80/0x120 [zfs]
[6972591.720115]  taskq_thread+0x2cc/0x500 [spl]
[6972591.720612]  ? __pfx_default_wake_function+0x10/0x10
[6972591.721091]  ? __pfx_zio_execute+0x10/0x10 [zfs]
[6972591.721630]  ? __pfx_taskq_thread+0x10/0x10 [spl]
[6972591.722080]  kthread+0xdd/0x100
[6972591.722509]  ? __pfx_kthread+0x10/0x10
[6972591.722922]  ret_from_fork+0x29/0x50
khain0 commented 1 day ago

The same issue occurred today

[103992.441067] CPU: 79 PID: 2564 Comm: z_rd_int_1 Kdump: loaded Tainted: P           OE  X  -------  ---  5.14.0-427.22.1.el9_4.x86_64 #1
[103992.443048] Hardware name: Dell Inc. PowerEdge XE8640/0TVHHH, BIOS 2.0.3 05/15/2024
[103992.444053] RIP: 0010:kfpu_end+0x34/0xa0 [zcommon]
[103992.445062] Code: 00 65 48 8b 04 25 28 00 00 00 48 89 44 24 08 31 c0 65 8b 05 4a c2 74 3f 48 98 48 8b 0c c2 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 19 fb 65 ff 0d 29 c2 74 3f 75 05 0f 1f 44 00 00 48 8b 44 24
[103992.447128] RSP: 0000:ff4809c2396478a0 EFLAGS: 00010046
[103992.448170] RAX: 00000000ffffffff RBX: ff4809c2396479a0 RCX: ff16a4239c857000
[103992.449228] RDX: 00000000ffffffff RSI: ff16a4799fde0000 RDI: ff4809c2396479c0
[103992.450284] RBP: 0000000000020000 R08: ff4809c2396479a0 R09: 0000000000000000
[103992.451163] R10: 0000000000000000 R11: ff16a465cee3f578 R12: ff16a4799fde0000
[103992.451978] R13: 0000000000020000 R14: 0000000000000000 R15: 0000000000000008
[103992.452796] FS:  0000000000000000(0000) GS:ff16a4a17f9c0000(0000) knlGS:0000000000000000
[103992.453627] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[103992.454454] CR2: 00007fb1fbc6a560 CR3: 000000afba548004 CR4: 0000000000771ee0
[103992.455229] PKRU: 55555554
[103992.456234] Call Trace:
[103992.457231]  <TASK>
[103992.457985]  ? show_trace_log_lvl+0x1c4/0x2df
[103992.458926]  ? show_trace_log_lvl+0x1c4/0x2df
[103992.459914]  ? abd_fletcher_4_iter+0x64/0xc0 [zcommon]
[103992.460886]  ? __die_body.cold+0x8/0xd
[103992.461829]  ? die_addr+0x39/0x60
[103992.462749]  ? exc_general_protection+0x1aa/0x400
[103992.463614]  ? asm_exc_general_protection+0x22/0x30
[103992.464441]  ? kfpu_end+0x34/0xa0 [zcommon]
[103992.465247]  abd_fletcher_4_iter+0x64/0xc0 [zcommon]
[103992.466032]  abd_iterate_func.part.0+0xbd/0x1c0 [zfs]
[103992.466907]  ? __pfx_abd_fletcher_4_iter+0x10/0x10 [zcommon]
[103992.467666]  abd_fletcher_4_native+0x7c/0xc0 [zfs]
[103992.468521]  ? update_sg_lb_stats+0x7e/0x450
[103992.469119]  ? blk_mq_start_request+0x34/0x120
[103992.469713]  ? nvme_prep_rq.part.0+0xab/0x110 [nvme]
[103992.470298]  ? nvme_queue_rqs+0x1e7/0x290 [nvme]
[103992.470959]  zio_checksum_error_impl+0xf9/0x640 [zfs]
[103992.471667]  ? __pfx_abd_fletcher_4_native+0x10/0x10 [zfs]
[103992.472362]  ? __blk_flush_plug+0xf1/0x150
[103992.473015]  ? remove_entity_load_avg+0x2e/0x70
[103992.473617]  ? migrate_task_rq_fair+0x14c/0x1d0
[103992.474228]  ? sched_clock+0xc/0x30
[103992.474743]  ? __smp_call_single_queue+0x93/0x120
[103992.475425]  ? ttwu_queue_wakelist+0xf2/0x110
[103992.475978]  ? try_to_wake_up+0x3e2/0x5d0
[103992.476622]  zio_checksum_error+0x64/0xc0 [zfs]
[103992.477363]  vdev_raidz_io_done+0x1b6/0x550 [zfs]
[103992.478090]  zio_vdev_io_done+0x7c/0x220 [zfs]
[103992.478811]  zio_execute+0x80/0x120 [zfs]
[103992.479534]  taskq_thread+0x2cc/0x500 [spl]
[103992.480143]  ? __pfx_default_wake_function+0x10/0x10
[103992.480731]  ? __pfx_zio_execute+0x10/0x10 [zfs]
[103992.481404]  ? __pfx_taskq_thread+0x10/0x10 [spl]
[103992.481976]  kthread+0xdd/0x100
[103992.482520]  ? __pfx_kthread+0x10/0x10
[103992.483028]  ret_from_fork+0x29/0x50
rincebrain commented 8 hours ago

I think this would be #14989, whose workaround is in 2.2.x but not backported into a 2.1.x release so far (it's in 2.1.16-staging, but I don't know if 2.1.16 will ever be released.)

You could try cherrypicking from f288fdb4bd521f263277bcdc76cdec12a169a1e5 if you can't upgrade to 2.2.x, but 2.2.x would probably be the simpler solution.