openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.62k stars 1.75k forks source link

PANIC at zio.c:315:zio_data_buf_alloc() #16527

Closed micsuka closed 1 month ago

micsuka commented 1 month ago

System information

Type Version/Name
Distribution Name Debian Linux
Distribution Version 11
Kernel Version 5.10.0-28
Architecture amd64
OpenZFS Version 2.0.3-9+deb11u1

Describe the problem you're observing

About the setup: We run several mariadb databases on zfs on Debian 11, the servers contain the same data through replication. All servers contain a zpool with mirrored SSDs and the dataset is compressed. I've attached the parameters of the datasets/pools.

I decided to move the database on an encrypted dataset, so I issued a zfs send ... | zfs receive -o keyformat=raw -o keylocation=prompt... a few days ago on 4 servers.

So, after the datasets have been encrypted, this panic occured on 2 servers of 4, after around 3 days.

Here is the kernel message on server 1:

Sep 09 19:22:56 malta0 kernel: VERIFY3(c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT) failed (36028797018963967 < 32768)
Sep 09 19:22:56 malta0 kernel: PANIC at zio.c:315:zio_data_buf_alloc()
Sep 09 19:22:56 malta0 kernel: Showing stack for process 3392903
Sep 09 19:22:56 malta0 kernel: CPU: 23 PID: 3392903 Comm: mariadbd Tainted: P          IOE     5.10.0-28-amd64 #1 Debian 5.10.209-2
Sep 09 19:22:56 malta0 kernel: Hardware name: Thomas-Krenn.AG X10DRi/X10DRi, BIOS 1.0b 09/17/2014
Sep 09 19:22:56 malta0 kernel: Call Trace:
Sep 09 19:22:56 malta0 kernel:  dump_stack+0x6b/0x83
Sep 09 19:22:56 malta0 kernel:  spl_panic+0xd4/0xfc [spl]
Sep 09 19:22:56 malta0 kernel:  ? spl_kmem_cache_alloc+0x74/0x7d0 [spl]
Sep 09 19:22:56 malta0 kernel:  ? kmem_cache_alloc+0xed/0x1f0
Sep 09 19:22:56 malta0 kernel:  ? spl_kmem_cache_alloc+0x97/0x7d0 [spl]
Sep 09 19:22:56 malta0 kernel:  ? aggsum_add+0x175/0x190 [zfs]
Sep 09 19:22:56 malta0 kernel:  ? mutex_lock+0xe/0x30
Sep 09 19:22:56 malta0 kernel:  ? aggsum_add+0x175/0x190 [zfs]
Sep 09 19:22:56 malta0 kernel:  zio_data_buf_alloc+0x55/0x60 [zfs]
Sep 09 19:22:56 malta0 kernel:  abd_alloc_linear+0x8e/0xd0 [zfs]
Sep 09 19:22:56 malta0 kernel:  arc_hdr_alloc_abd+0xe3/0x1f0 [zfs]
Sep 09 19:22:56 malta0 kernel:  arc_hdr_alloc+0x104/0x170 [zfs]
Sep 09 19:22:56 malta0 kernel:  arc_alloc_buf+0x46/0x150 [zfs]
Sep 09 19:22:56 malta0 kernel:  dbuf_hold_copy.constprop.0+0x31/0xa0 [zfs]
Sep 09 19:22:56 malta0 kernel:  dbuf_hold_impl+0x480/0x670 [zfs]
Sep 09 19:22:56 malta0 kernel:  dbuf_hold_level+0x2b/0x60 [zfs]
Sep 09 19:22:56 malta0 kernel:  dmu_tx_check_ioerr+0x35/0xd0 [zfs]
Sep 09 19:22:56 malta0 kernel:  dmu_tx_count_write+0x68/0x1a0 [zfs]
Sep 09 19:22:56 malta0 kernel:  dmu_tx_hold_write_by_dnode+0x35/0x50 [zfs]
Sep 09 19:22:56 malta0 kernel:  zfs_write+0x3f1/0xc80 [zfs]
Sep 09 19:22:56 malta0 kernel:  zpl_iter_write+0x103/0x170 [zfs]
Sep 09 19:22:56 malta0 kernel:  new_sync_write+0x11c/0x1b0
Sep 09 19:22:56 malta0 kernel:  vfs_write+0x1ce/0x260
Sep 09 19:22:56 malta0 kernel:  ksys_write+0x5f/0xe0
Sep 09 19:22:56 malta0 kernel:  do_syscall_64+0x33/0x80
Sep 09 19:22:56 malta0 kernel:  entry_SYSCALL_64_after_hwframe+0x62/0xc7
Sep 09 19:22:56 malta0 kernel: RIP: 0033:0x7fbba8223fef
Sep 09 19:22:56 malta0 kernel: Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 29 fd ff ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77>
Sep 09 19:22:56 malta0 kernel: RSP: 002b:00007fbba44d16b0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
Sep 09 19:22:56 malta0 kernel: RAX: ffffffffffffffda RBX: 00000000000000a9 RCX: 00007fbba8223fef
Sep 09 19:22:56 malta0 kernel: RDX: 00000000000000a9 RSI: 00005581f7309338 RDI: 0000000000000026
Sep 09 19:22:56 malta0 kernel: RBP: 00007fbba44d1730 R08: 0000000000000000 R09: 0000000000000000
Sep 09 19:22:56 malta0 kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 00000000000000a9
Sep 09 19:22:56 malta0 kernel: R13: 00005581f7309338 R14: 00005581f7309338 R15: 0000000000000026

and here is the kernel log on server 2:

Sep 10 16:18:48 hetza1 kernel: VERIFY3(c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT) failed (36028797018963967 < 32768)
Sep 10 16:18:48 hetza1 kernel: PANIC at zio.c:315:zio_data_buf_alloc()
Sep 10 16:18:48 hetza1 kernel: Showing stack for process 629911
Sep 10 16:18:48 hetza1 kernel: CPU: 13 PID: 629911 Comm: mariadbd Tainted: P           OE     5.10.0-28-amd64 #1 Debian 5.10.209-2
Sep 10 16:18:48 hetza1 kernel: Hardware name: ASUSTeK COMPUTER INC. KRPA-U16 Series/KRPA-U16 Series, BIOS 4102 11/17/2021
Sep 10 16:18:48 hetza1 kernel: Call Trace:
Sep 10 16:18:48 hetza1 kernel:  dump_stack+0x6b/0x83
Sep 10 16:18:48 hetza1 kernel:  spl_panic+0xd4/0xfc [spl]
Sep 10 16:18:48 hetza1 kernel:  ? spl_kmem_cache_alloc+0x74/0x7d0 [spl]
Sep 10 16:18:48 hetza1 kernel:  ? kmem_cache_alloc+0xed/0x1f0
Sep 10 16:18:48 hetza1 kernel:  ? spl_kmem_cache_alloc+0x97/0x7d0 [spl]
Sep 10 16:18:48 hetza1 kernel:  ? aggsum_add+0x175/0x190 [zfs]
Sep 10 16:18:48 hetza1 kernel:  ? mutex_lock+0xe/0x30
Sep 10 16:18:48 hetza1 kernel:  ? aggsum_add+0x175/0x190 [zfs]
Sep 10 16:18:48 hetza1 kernel:  zio_data_buf_alloc+0x55/0x60 [zfs]
Sep 10 16:18:48 hetza1 kernel:  abd_alloc_linear+0x8e/0xd0 [zfs]
Sep 10 16:18:48 hetza1 kernel:  arc_hdr_alloc_abd+0xe3/0x1f0 [zfs]
Sep 10 16:18:48 hetza1 kernel:  arc_hdr_alloc+0x104/0x170 [zfs]
Sep 10 16:18:48 hetza1 kernel:  arc_alloc_buf+0x46/0x150 [zfs]
Sep 10 16:18:48 hetza1 kernel:  dbuf_hold_copy.constprop.0+0x31/0xa0 [zfs]
Sep 10 16:18:48 hetza1 kernel:  dbuf_hold_impl+0x480/0x670 [zfs]
Sep 10 16:18:48 hetza1 kernel:  dbuf_hold_level+0x2b/0x60 [zfs]
Sep 10 16:18:48 hetza1 kernel:  dmu_tx_check_ioerr+0x35/0xd0 [zfs]
Sep 10 16:18:48 hetza1 kernel:  dmu_tx_count_write+0xed/0x1a0 [zfs]
Sep 10 16:18:48 hetza1 kernel:  dmu_tx_hold_write_by_dnode+0x35/0x50 [zfs]
Sep 10 16:18:48 hetza1 kernel:  zfs_write+0x3f1/0xc80 [zfs]
Sep 10 16:18:48 hetza1 kernel:  ? aa_sk_perm+0x3e/0x1b0
Sep 10 16:18:48 hetza1 kernel:  zpl_iter_write+0x103/0x170 [zfs]
Sep 10 16:18:48 hetza1 kernel:  new_sync_write+0x11c/0x1b0
Sep 10 16:18:48 hetza1 kernel:  vfs_write+0x1ce/0x260
Sep 10 16:18:48 hetza1 kernel:  ksys_write+0x5f/0xe0
Sep 10 16:18:48 hetza1 kernel:  do_syscall_64+0x33/0x80
Sep 10 16:18:48 hetza1 kernel:  entry_SYSCALL_64_after_hwframe+0x62/0xc7
Sep 10 16:18:48 hetza1 kernel: RIP: 0033:0x7f0c8c13bfef
Sep 10 16:18:48 hetza1 kernel: Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 29 fd ff ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77>
Sep 10 16:18:48 hetza1 kernel: RSP: 002b:00007f0c883f4c50 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
Sep 10 16:18:48 hetza1 kernel: RAX: ffffffffffffffda RBX: 000000000000002a RCX: 00007f0c8c13bfef
Sep 10 16:18:48 hetza1 kernel: RDX: 000000000000002a RSI: 00007f039401d978 RDI: 000000000000028c
Sep 10 16:18:48 hetza1 kernel: RBP: 00007f0c883f4cd0 R08: 0000000000000000 R09: 0000000000000234
Sep 10 16:18:48 hetza1 kernel: R10: 000000000000002a R11: 0000000000000293 R12: 0000000000000001
Sep 10 16:18:48 hetza1 kernel: R13: 000000000000002a R14: 00007f039401d978 R15: 000000000000028c

zfs1info.txt zfs2info.txt

I exclude the hardware problem, there was no trace of any error in the logs and like I said: these systems were rock solid for years. The servers contain ECC RAMs, the CPUs support aes.

Describe how to reproduce the problem

I'm confident that this problem is related to the zfs encryption.

Include any warning/errors/backtraces from the system logs

rincebrain commented 1 month ago

I recommend running a version released after 2021 and seeing if your problem is resolved.

(Specifically, https://github.com/openzfs/zfs/commit/4036b8d027fb7fe1a629b08a0d23cac975ab2eb9 might be useful, but there's a lot of bugs in native encryption, some of which have been fixed in the intervening 3.5 years since 2.0.3 was released. If you don't want to upgrade, you should probably file bugs against Debian, not upstream.)

micsuka commented 1 month ago

Thank you, I've updated the zfs to 2.1.11-1~bpo11+1 on one server for now and I set the encryption back. It handles the same load now, let's see how it's behaving in the next few weeks.

micsuka commented 1 month ago

so, I have zfs-2.1.11-1~bpo11+1 on all of our servers now... and it seems to be stable.