openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.42k stars 1.72k forks source link

ZTS: zvol_misc_trim VERIFY(zh->zh_claim_txg == 0) on F38 #14872

Closed tonyhutter closed 1 month ago

tonyhutter commented 1 year ago

System information

Type Version/Name
Distribution Name Fedora
Distribution Version 38
Kernel Version
Architecture
OpenZFS Version master

Describe the problem you're observing

Narrow down the cause of the following CI failure:

http://build.zfsonlinux.org/builders/Fedora%2038%20x86_64%20%28TEST%29/builds/198/steps/shell_4/logs/console

Describe how to reproduce the problem

Run the test on F38 buildbot:

VERIFY(zh->zh_claim_txg == 0) failed
PANIC at zil.c:904:zil_create()

Include any warning/errors/backtraces from the system logs

tonyhutter commented 1 year ago

Perhaps unrelated, but while trying to reproduce this issue on a F38 VM, I hit this other assertion while running the zvol_misc tests:

May 17 15:25:17 fedora38 kernel: EXT4-fs (zd0): unmounting filesystem 9c00a3ca-e8ba-4988-89af-11186365a7e6.
May 17 15:25:17 fedora38 kernel: VERIFY(dr->dr_dbuf == db) failed
May 17 15:25:17 fedora38 kernel: PANIC at dbuf.c:1101:dbuf_verify()
May 17 15:25:17 fedora38 kernel: Showing stack for process 1969
May 17 15:25:17 fedora38 kernel: CPU: 0 PID: 1969 Comm: zvol Tainted: P           OE      6.2.15-300.fc38.x86_64 #1
May 17 15:25:17 fedora38 kernel: Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.16.0-3.module+el8.7.0+16689+53d59bc2 04/01/2014
May 17 15:25:17 fedora38 kernel: Call Trace:
May 17 15:25:17 fedora38 kernel:  <TASK>
May 17 15:25:17 fedora38 kernel:  dump_stack_lvl+0x43/0x60
May 17 15:25:17 fedora38 kernel:  spl_panic+0xfc/0x120 [spl]
May 17 15:25:17 fedora38 kernel:  ? cityhash4+0x78/0xa0 [zfs]
May 17 15:25:17 fedora38 kernel:  spl_assert+0x17/0x20 [zfs]
May 17 15:25:17 fedora38 kernel:  dbuf_verify+0x59b/0x950 [zfs]
May 17 15:25:17 fedora38 kernel:  dbuf_hold_impl+0x17b/0xbe0 [zfs]
May 17 15:25:17 fedora38 kernel:  ? dbuf_read+0x2ce/0x740 [zfs]
May 17 15:25:17 fedora38 kernel:  dbuf_hold_level+0x2b/0x60 [zfs]
May 17 15:25:17 fedora38 kernel:  dmu_buf_hold_array_by_dnode+0x16b/0x6a0 [zfs]
May 17 15:25:17 fedora38 kernel:  dmu_read_impl+0xb5/0x1a0 [zfs]
May 17 15:25:17 fedora38 kernel:  zvol_get_data+0xc3/0x2c0 [zfs]
May 17 15:25:17 fedora38 kernel:  zil_lwb_commit+0x551/0x720 [zfs]
May 17 15:25:17 fedora38 kernel:  zil_process_commit_list+0x1d5/0x530 [zfs]
May 17 15:25:17 fedora38 kernel:  zil_commit_writer+0x100/0x1b0 [zfs]
May 17 15:25:17 fedora38 kernel:  zil_commit_impl+0x5d/0xa0 [zfs]
May 17 15:25:17 fedora38 kernel:  zvol_write+0x3cc/0x510 [zfs]
May 17 15:25:17 fedora38 kernel:  zvol_write_task+0xe/0x20 [zfs]
May 17 15:25:17 fedora38 kernel:  taskq_thread+0x1f7/0x410 [spl]
May 17 15:25:17 fedora38 kernel:  ? __pfx_default_wake_function+0x10/0x10
May 17 15:25:17 fedora38 kernel:  ? __pfx_zvol_write_task+0x10/0x10 [zfs]
May 17 15:25:17 fedora38 kernel:  ? __pfx_taskq_thread+0x10/0x10 [spl]
May 17 15:25:17 fedora38 kernel:  kthread+0xe6/0x110
May 17 15:25:17 fedora38 kernel:  ? __pfx_kthread+0x10/0x10
May 17 15:25:17 fedora38 kernel:  ret_from_fork+0x29/0x50
May 17 15:25:17 fedora38 kernel:  </TASK>

May 17 15:25:22 fedora38 kernel: list_del corruption. next->prev should be ffff9f273d018800, but was a5a5a5a5a5a5a5a5. (next=ffff9f273d047c00)
tonyhutter commented 1 year ago

Another failure example from the F38 VM:

[  434.721308] ZTS run /home/hutter/zfs/tests/zfs-tests/tests/functional/zvol/zvol_misc/zvol_misc_002_pos
[  434.798468] EXT4-fs (zd0): mounting ext2 file system using the ext4 subsystem
[  434.800255] EXT4-fs (zd0): mounted filesystem f609dc71-4e13-4c70-8b0e-33841d1bdf33 without journal. Quota mode: none.
[  435.366527] Kernel panic - not syncing: buffer modified while frozen!
[  435.366907] CPU: 7 PID: 1922 Comm: zvol Tainted: P           OE      6.2.15-300.fc38.x86_64 #1
[  435.367382] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.16.0-3.module+el8.7.0+16689+53d59bc2 04/01/2014
[  435.367880] Call Trace:
[  435.368019]  <TASK>
[  435.368142]  dump_stack_lvl+0x43/0x60
[  435.368349]  panic+0x360/0x370
[  435.368522]  arc_cksum_verify+0x216/0x220 [zfs]
[  435.368931]  arc_buf_thaw+0x2b/0xc0 [zfs]
[  435.369274]  dmu_buf_will_dirty_impl+0x1cc/0x350 [zfs]
[  435.369674]  dmu_write_uio_dnode+0x96/0x1c0 [zfs]
[  435.370054]  zvol_write+0x284/0x510 [zfs]
[  435.370429]  zvol_write_task+0xe/0x20 [zfs]
[  435.370782]  taskq_thread+0x1f7/0x410 [spl]
[  435.371020]  ? __pfx_default_wake_function+0x10/0x10
[  435.371294]  ? __pfx_zvol_write_task+0x10/0x10 [zfs]
[  435.371670]  ? __pfx_taskq_thread+0x10/0x10 [spl]
[  435.371934]  kthread+0xe6/0x110
[  435.372113]  ? __pfx_kthread+0x10/0x10
[  435.372319]  ret_from_fork+0x29/0x50
[  435.372519]  </TASK>
[  435.372702] Kernel Offset: 0x3a000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  435.373279] ---[ end Kernel panic - not syncing: buffer modified while frozen! ]---
tonyhutter commented 1 year ago

I put out a fix in #14879 but looks like isn't fixing the original assert:

[15237.402108] VERIFY(zh->zh_claim_txg == 0) failed
[15237.405866] PANIC at zil.c:904:zil_create()
[15237.409462] Showing stack for process 2616803
[15237.413054] CPU: 1 PID: 2616803 Comm: zvol Tainted: P           OE      6.2.15-300.fc38.x86_64 #1
[15237.419861] Hardware name: Amazon EC2 m5d.large/, BIOS 1.0 10/16/2017
[15237.424039] Call Trace:
[15237.426794]  <TASK>
[15237.429438]  dump_stack_lvl+0x43/0x60
[15237.432644]  spl_panic+0xfc/0x120 [spl]
[15237.435959]  ? preempt_count_add+0x6a/0xa0
[15237.439333]  ? _raw_spin_lock+0x13/0x40
[15237.442618]  spl_assert+0x17/0x20 [zfs]
[15237.446196]  zil_create+0x428/0x590 [zfs]
[15237.449768]  ? dbuf_hold_impl+0x1c9/0xbe0 [zfs]
[15237.453485]  zil_process_commit_list+0x343/0x530 [zfs]
[15237.457459]  ? _raw_spin_lock+0x13/0x40
[15237.460736]  zil_commit_writer+0x100/0x1b0 [zfs]
[15237.464554]  zil_commit_impl+0x5d/0xa0 [zfs]
[15237.468248]  zvol_write+0x39b/0x510 [zfs]
[15237.471823]  ? __schedule+0x3d4/0x13e0
[15237.475070]  zvol_write_task+0xe/0x20 [zfs]
[15237.478704]  taskq_thread+0x1fa/0x410 [spl]
[15237.482125]  ? __pfx_default_wake_function+0x10/0x10
[15237.485800]  ? __pfx_zvol_write_task+0x10/0x10 [zfs]
[15237.489725]  ? __pfx_taskq_thread+0x10/0x10 [spl]
[15237.493305]  kthread+0xe9/0x110
[15237.496330]  ? __pfx_kthread+0x10/0x10
[15237.499550]  ret_from_fork+0x2c/0x50
[15237.502720]  </TASK>
robn commented 2 months ago

I just reproduced this on a Debian 12.6 VM, OpenZFS master @ fd51786, 6 CPUs, 64G RAM.

$ uname -a
Linux shop 6.1.0-22-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.94-1 (2024-06-21) x86_64 GNU/Linux

$ zfs version
zfs-2.2.99-537_gfd51786f8
zfs-kmod-2.2.99-537_gfd51786f8

I've never seen it before, and I thought it was because of some ZIL changes I'm testing, but rolling back to master and it's easily reproduceable.

Of interest is that it's 6.1, this VM was updated from Debian 12.5 just last night, and I can't reproduce it on a similarly-sized quiz VM with a minimal 6.1.94 build from a week ago.

I'll try to narrow it down further.

robn commented 2 months ago

Alright, I think I figured it out. But I'm exhausted, long day. I'll tell you all about it tomorrow, hopefully in the form of a patch :)

robn commented 2 months ago

Ok, further investigation today and it took a twist in a direction I didn't quite expect. I still don’t know all of the pieces, but here’s some things I do know.

So when this happens, the pool is exported and the zvol has an empty block on its ZIL. When the pool is reimported, the ZIL block is claimed, and its zh_claim_txg is set. But, it’s just an empty block, with nothing to replay, so ZIL_REPLAY_NEEDED is not set. So, a call to zil_replay() would be bounced through zil_destroy(), zl_destroy_txg is set, the dataset is dirtied, and next sync zil_sync() and the header will be cleared. This is the only place that clears zh_claim_txg; that the assert is tripped in zil_create() ultimately shows that neither zil_replay() nor zil_destroy() were called.

zvol_os_create_minor() is where the call to zil_replay() happens for zvols. However, we’re not getting down to that part of the function. I’m not yet sure why not, but, it seems that it has to be because the device node already exists. We know this must be the case because the node is only created at the very end when zvol_os_add_disk() is called, and that will be only called if error == 0.

The device node definitely exists though; there’s no other way to get onto the write path in zvol_request_impl() except through the device node. So I’m not sure where it’s coming from, except perhaps left over from the previous test. Maybe its even the same failure that leads to a ZIL block being left over? More to study there.

All seven calls to zvol_os_create_minor() throw the error away. As a test I wrapped them all in VERIFY0(), and the zvol_misc test set tripped it almost immediately. So I guess we’re doing this a lot? I have no idea really; I’ve barely been inside the zvol code.

So, what to do? I don’t really want to tackle all of this, but at the same time I need to do something - it’s blocking some ZIL changes that I’m trying to validate. In the absence of a complete fix, I’m thinking about having zil_claim() not set zh_claim_txg if it’s in the past and there’s nothing to replay. That’s very targeted at this case though; not sure if it would actually be a bad thing to do in some real world cases.

I'll try again in a day or two; really need to get on with something different tomorrow.

amotin commented 2 months ago

I am not intimate with Linux ZVOLs, but on a quick look I see only one place to fail between zvol_alloc() and zil_replay()/zil_destroy(), that is dataset_kstats_create(). And in that case I don't see ZVOL being destroyed via zvol_os_free() or anything else, while its idx is freed via ida_simple_remove(). Am I missing something, or there is some innocent leak here, or it is related to the issue? At very least it sounds wrong to me to free the idx before freeing the related disk.

robn commented 1 month ago

See #16364 for what's actually going on. As far as this issue is concerned, it's a symptom of something else.

At very least it sounds wrong to me to free the idx before freeing the related disk.

Good observation. At this point it's just another thing that's wrong with the zvol code :disappointed: