Open lundman opened 1 month ago
``` 4.003894 ffff878f3b5a8040 0000000 Blocked nt!KiSwapContext+0x76 nt!KiSwapThread+0xab5 nt!KiCommitThreadWait+0x137 nt!KeWaitForSingleObject+0x256 OpenZFS!spl_mutex_enter+0x11f OpenZFS!spl_cv_wait+0xf8 OpenZFS!txg_wait_synced_impl+0x290 OpenZFS!txg_wait_synced+0x29 OpenZFS!dmu_tx_wait+0x2ac OpenZFS!dmu_tx_assign+0x17e OpenZFS!zvol_os_write_zv+0x3c9 OpenZFS!wzvol_WkRtn+0x54e OpenZFS!wzvol_GeneralWkRtn+0x7b nt!IopProcessWorkItem+0x13c nt!ExpWorkerThread+0x155 nt!PspSystemThreadStartup+0x57 nt!KiStartSystemThread+0x34 4.0029fc ffff878f36f75040 0000938 Blocked nt!KiSwapContext+0x76 nt!KiSwapThread+0xab5 nt!KiCommitThreadWait+0x137 nt!KeRemovePriQueue+0x1ff nt!ExpWorkerThread+0xd3 nt!PspSystemThreadStartup+0x57 nt!KiStartSystemThread+0x34 4.003490 ffff878f39963040 0000000 Blocked nt!KiSwapContext+0x76 nt!KiSwapThread+0xab5 nt!KiCommitThreadWait+0x137 nt!KeWaitForSingleObject+0x256 OpenZFS!spl_mutex_enter+0x11f OpenZFS!spl_cv_wait+0xf8 OpenZFS!txg_wait_synced_impl+0x290 OpenZFS!txg_wait_synced+0x29 OpenZFS!dmu_tx_wait+0x2ac OpenZFS!dmu_tx_assign+0x17e OpenZFS!zvol_os_write_zv+0x3c9 OpenZFS!wzvol_WkRtn+0x54e OpenZFS!wzvol_GeneralWkRtn+0x7b nt!IopProcessWorkItem+0x13c 4.003260 ffff878f3b02c040 0000003 READY nt!KiSwapContext+0x76 nt!KeYieldExecution+0xf11 nt!KeDelayExecutionThread+0x147 OpenZFS!kpreempt+0x2f OpenZFS!kmem_magazine_destroy+0x1ce OpenZFS!kmem_depot_ws_reap+0x1ca OpenZFS!kmem_cache_reap+0x122 OpenZFS!taskq_thread+0x51a 4.000cf4 ffff878f3f6d6080 0000000 RUNNING OpenZFS!zfs_refcount_destroy_many+0x90 OpenZFS!zfs_refcount_destroy+0x17 OpenZFS!abd_fini_struct+0x68 OpenZFS!abd_free+0xdd OpenZFS!zio_pop_transforms+0x76 OpenZFS!zio_done+0x1392 OpenZFS!zio_execute+0x30f OpenZFS!zio_execute+0x30f OpenZFS!vdev_mirror_io_start+0x4d3 OpenZFS!zio_vdev_io_start+0x12b OpenZFS!zio_execute+0x30f OpenZFS!zio_execute+0x30f OpenZFS!taskq_thread+0x51a 4.002f34 ffff878f43491080 0000000 READY nt!KiSwapContext+0x76 nt!KiDispatchInterrupt+0x67e nt!KiDpcInterrupt+0x350 nt!KeReleaseSpinLock+0x2d OpenZFS!spl_mutex_exit+0xfd OpenZFS!__zfs_dbgmsg+0xba OpenZFS!__dprintf+0x29e OpenZFS!zio_deadman+0x13c OpenZFS!zio_wait+0x69e OpenZFS!dsl_pool_sync+0x2f9 OpenZFS!spa_sync_iterate_to_convergence+0x109 OpenZFS!spa_sync+0x66d OpenZFS!txg_sync_thread+0x45b nt!PspSystemThreadStartup+0x57 4.000d3c ffff878f3a48d080 0000000 Blocked nt!KiSwapContext+0x76 nt!KiSwapThread+0xab5 nt!KiCommitThreadWait+0x137 nt!KeWaitForSingleObject+0x256 OpenZFS!spl_mutex_enter+0x11f OpenZFS!spl_cv_wait+0xf8 OpenZFS!txg_wait_synced_impl+0x290 OpenZFS!txg_wait_synced+0x29 OpenZFS!dmu_tx_wait+0x2ac OpenZFS!dmu_tx_assign+0x17e OpenZFS!zvol_os_write_zv+0x3c9 OpenZFS!wzvol_WkRtn+0x54e OpenZFS!wzvol_GeneralWkRtn+0x7b nt!IopProcessWorkItem+0x13c 4.002908 ffff878f3e6f62c0 0000000 READY nt!KiSwapContext+0x76 nt!KiSwapThread+0xab5 nt!KiCommitThreadWait+0x137 nt!KeWaitForSingleObject+0x256 OpenZFS!spl_mutex_enter+0x11f OpenZFS!spl_cv_wait+0xf8 OpenZFS!txg_wait_synced_impl+0x290 OpenZFS!txg_wait_synced+0x29 OpenZFS!dmu_tx_wait+0x2ac OpenZFS!dmu_tx_assign+0x17e OpenZFS!zvol_os_write_zv+0x3c9 OpenZFS!wzvol_WkRtn+0x54e OpenZFS!wzvol_GeneralWkRtn+0x7b nt!IopProcessWorkItem+0x13c 4.00325c ffff878f3a126040 0000000 STANDBY nt!KiSwapContext+0x76 nt!KiSwapThread+0xab5 nt!KiCommitThreadWait+0x137 nt!KeWaitForSingleObject+0x256 OpenZFS!spl_mutex_enter+0x11f OpenZFS!spl_cv_wait+0xf8 OpenZFS!txg_wait_synced_impl+0x290 OpenZFS!txg_wait_synced+0x29 OpenZFS!dmu_tx_wait+0x2ac OpenZFS!dmu_tx_assign+0x17e OpenZFS!zvol_os_write_zv+0x3c9 OpenZFS!wzvol_WkRtn+0x54e OpenZFS!wzvol_GeneralWkRtn+0x7b nt!IopProcessWorkItem+0x13c nt!ExpWorkerThread+0x155 nt!PspSystemThreadStartup+0x57 nt!KiStartSystemThread+0x34 4.002c90 ffff878f38f06040 0000000 Blocked nt!KiSwapContext+0x76 nt!KiSwapThread+0xab5 nt!KiCommitThreadWait+0x137 nt!KeWaitForSingleObject+0x256 OpenZFS!spl_mutex_enter+0x11f OpenZFS!spl_cv_wait+0xf8 OpenZFS!txg_wait_synced_impl+0x290 OpenZFS!txg_wait_synced+0x29 OpenZFS!dmu_tx_wait+0x2ac OpenZFS!dmu_tx_assign+0x17e OpenZFS!zvol_os_write_zv+0x3c9 OpenZFS!wzvol_WkRtn+0x54e OpenZFS!wzvol_GeneralWkRtn+0x7b nt!IopProcessWorkItem+0x13c 4.001bd4 ffff878f3b0c6580 0000000 Blocked nt!KiSwapContext+0x76 nt!KiSwapThread+0xab5 nt!KiCommitThreadWait+0x137 nt!KeWaitForSingleObject+0x256 OpenZFS!spl_mutex_enter+0x11f OpenZFS!spl_cv_wait+0xf8 OpenZFS!txg_wait_synced_impl+0x290 OpenZFS!txg_wait_synced+0x29 OpenZFS!dmu_tx_wait+0x2ac OpenZFS!dmu_tx_assign+0x17e OpenZFS!zvol_os_write_zv+0x3c9 OpenZFS!wzvol_WkRtn+0x54e OpenZFS!wzvol_GeneralWkRtn+0x7b nt!IopProcessWorkItem+0x13c nt!ExpWorkerThread+0x155 nt!PspSystemThreadStartup+0x57 nt!KiStartSystemThread+0x34 4.0013ec ffff878f3b0b45c0 0000000 Blocked nt!KiSwapContext+0x76 nt!KiSwapThread+0xab5 nt!KiCommitThreadWait+0x137 nt!KeWaitForSingleObject+0x256 OpenZFS!spl_mutex_enter+0x11f OpenZFS!spl_cv_wait+0xf8 OpenZFS!txg_wait_synced_impl+0x290 OpenZFS!txg_wait_synced+0x29 OpenZFS!dmu_tx_wait+0x2ac OpenZFS!dmu_tx_assign+0x17e OpenZFS!zvol_os_write_zv+0x3c9 OpenZFS!wzvol_WkRtn+0x54e OpenZFS!wzvol_GeneralWkRtn+0x7b nt!IopProcessWorkItem+0x13c 4.001704 ffff878f42a13080 0000000 Blocked nt!KiSwapContext+0x76 nt!KiSwapThread+0xab5 nt!KiCommitThreadWait+0x137 nt!KeWaitForSingleObject+0x256 OpenZFS!spl_mutex_enter+0x11f OpenZFS!spl_cv_wait+0xf8 OpenZFS!txg_wait_synced_impl+0x290 OpenZFS!txg_wait_synced+0x29 OpenZFS!dmu_tx_wait+0x2ac OpenZFS!dmu_tx_assign+0x17e OpenZFS!zvol_os_write_zv+0x3c9 OpenZFS!wzvol_WkRtn+0x54e OpenZFS!wzvol_GeneralWkRtn+0x7b nt!IopProcessWorkItem+0x13c nt!ExpWorkerThread+0x155 nt!PspSystemThreadStartup+0x57 nt!KiStartSystemThread+0x34 4.0036e8 ffff878f3fbd6080 0000000 RUNNING nt!write_char+0x46 nt!write_string+0x48 nt!output_l+0x5cc nt!snprintf+0x69 OpenZFS!__dprintf+0x1ee OpenZFS!txg_wait_synced_impl+0x200 OpenZFS!txg_wait_synced+0x29 OpenZFS!dmu_tx_wait+0x2ac OpenZFS!dmu_tx_assign+0x17e OpenZFS!zvol_os_write_zv+0x3c9 OpenZFS!wzvol_WkRtn+0x54e OpenZFS!wzvol_GeneralWkRtn+0x7b nt!IopProcessWorkItem+0x13c ```
Applying https://github.com/openzfs/zfs/pull/16216 and running through the test above, the ASSERT has not been triggered. The breakpoint at zio_wait restarting hung I/O for pool 'tank'
has kicked in so I feel it should have had a panic after that. So it does appear to help with this problem, even if it isn't obvious to me why that is.
Woo commenting never fails!
VERIFY3( rc->rc_count == number ) failed (5120 == 0)
Break instruction exception - code 80000003 (first chance)
OpenZFS!zfs_refcount_destroy_many+0x90:
So we are in fact tracking two problems in https://github.com/openzfs/zfs/issues/15802
1: kd> dt rc
Local var @ 0xffff978c74d95748 Type refcount*
0xffff878f`85da9ba0
+0x000 rc_count : 0
+0x008 rc_mtx : kmutex
+0x038 rc_tree : avl_tree
+0x058 rc_removed : list
+0x078 rc_removed_count : 0
+0x07c rc_tracked : 0 ( B_FALSE )
same thing where rc_count
is actually 0, even if the ASSRT print says otherwise.
buffer:
FFFF878F3A7C3080: zio.c:2236:zio_deadman(): zio_wait restarting hung I/O for pool 'BOOM'
FFFF878F3E55C080: dprintf: dnode.c:2470:dnode_diduse_space(): ds=BOOM/vol obj=1 dn=FFFF878F49F435E0 dnp=FFFF878F49F24200 used=586098688 delta=2048
FFFF878F3E55C080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_4096)(steps: 0) p->vs_start, end == 0, 4096
FFFF878F3E55C080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_heap)(steps: 0) p->vs_start, end == 0, 4096
FFFF878F3E55C080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_4096)(steps: 0) p->vs_start, end == 0, 4096
FFFF878F3E55C080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_heap)(steps: 0) p->vs_start, end == 0, 4096
FFFF878F3BDFB080: dprintf: dsl_dataset.c:291:dsl_dataset_block_kill(): freeing ds=81 DVA[0]=<0:421e7a00:c00> [L0 zvol object] fletcher4 zstd unencrypted LE contiguous unique single size=4000L/c00P birth=81L/81P fill=1 cksum=0000
013e082cd2e0:00022b37cff68f1a:023e5ed0cc6a6648:b6f79395539380b9
FFFF878F3BDFB080: dprintf: dsl_dataset.c:145:dsl_dataset_block_born(): ds=FFFF878F49BF1000 DVA[0]=<0:ac8e2c00:1400> [L0 zvol object] fletcher4 zstd unencrypted LE contiguous unique single size=4000L/1400P birth=105L/105P fill=1
cksum=0000025d95e0c23b:000641ab5cf96f7c:0a80ce5969bbb04a:37e02aab71c09ebd
FFFF878F3A7C3080: zio.c:2236:zio_deadman(): zio_wait restarting hung I/O for pool 'BOOM'
FFFF878F3A7C3080: zio.c:2236:zio_deadman(): zio_wait restarting hung I/O for pool 'BOOM'
FFFF878F3A7C3080: zio.c:2236:zio_deadman(): zio_wait restarting hung I/O for pool 'BOOM'
FFFF878F3A7C3080: zio.c:2236:zio_deadman(): zio_wait restarting hung I/O for pool 'BOOM'
FFFF878F3E55C080: dprintf: dnode.c:2470:dnode_diduse_space(): ds=BOOM/vol obj=1 dn=FFFF878F49F435E0 dnp=FFFF878F49F24200 used=586100736 delta=-1024
FFFF878F3E55C080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_4096)(steps: 0) p->vs_start, end == 0, 4096
FFFF878F3E55C080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_heap)(steps: 0) p->vs_start, end == 0, 4096
FFFF878F3BDFB080: dprintf: dsl_dataset.c:291:dsl_dataset_block_kill(): freeing ds=81 DVA[0]=<0:421e9400:1000> [L0 zvol object] fletcher4 zstd unencrypted LE contiguous unique single size=4000L/1000P birth=81L/81P fill=1 cksum=00
0001d898ceee0b:000405240751c14a:056995fd2da433e0:6e23f7873626b125
FFFF878F3BDFB080: dprintf: dsl_dataset.c:145:dsl_dataset_block_born(): ds=FFFF878F49BF1000 DVA[0]=<0:ac8e4000:c00> [L0 zvol object] fletcher4 zstd unencrypted LE contiguous unique single size=4000L/c00P birth=105L/105P fill=1 ck
sum=000001536f6b6065:000221f16aedc85e:0224ef65e5714350:9b9edd3b7e772116
FFFF878F3A7C3080: zio.c:2236:zio_deadman(): zio_wait restarting hung I/O for pool 'BOOM'
FFFF878F3E55C080: dprintf: dnode.c:2470:dnode_diduse_space(): ds=BOOM/vol obj=1 dn=FFFF878F49F435E0 dnp=FFFF878F49F24200 used=586099712 delta=512
FFFF878F3E55C080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_4096)(steps: 0) p->vs_start, end == 0, 4096
FFFF878F3E55C080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_heap)(steps: 0) p->vs_start, end == 0, 4096
FFFF878F3E55C080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_4096)(steps: 0) p->vs_start, end == 0, 4096
FFFF878F3E55C080: SPL: vmem_freelist_insert_sort_by_time: at marker (bucket_heap)(steps: 0) p->vs_start, end == 0, 4096
FFFF878F3A7C3080: zio.c:2197:zio_deadman_impl(): slow zio[11]: zio=FFFF878F636B5E00x timestamp=10539098126100 delta=42800 queued=0 io=10539098127900 path=/dev/physicaldrive0 last=10539077025100 type=2 priority=3 flags=0x184080 stage=0x200000 pipeline=0x2e00000 pipeline-trace=0x200001 objset=81 object=1
level=0 blkid=72629 offset=2899201024 size=5120 error=0
VERIFY3( rc->rc_count == number ) failed (5120 == 0)
size and rc_count both being 5120 is curious.
I haven't looked what is that I/O, but it seems reasonable to me that if some ZIO haven't completed, its buffer is likely not freed. "failmode=continue" is a direct request for troubles.
Yeah I assume the IO was scheduled, but not yet completed - the deadman triggers and the IO is restarted. It would seem this isn't done as easily as the code is.
I would prefer it didn't panic of course. It can be triggered on a system without tuning, it is just harder. If we don't want to support failmode=continue
then we should probably remove it.
To trigger the bug referenced in https://github.com/openzfsonwindows/openzfs/issues/364#issuecomment-2101862948 and probably also https://github.com/openzfs/zfs/issues/15802
Will see message:
I can confirm that https://github.com/openzfs/zfs/blob/master/module/zfs/zio.c#L2205-L2208
are called a few times, until eventually
https://github.com/openzfsonwindows/openzfs/blob/development/module/zfs/refcount.c#L98
With the stack:
However, note that
So we die in the VERIFY because
rc_count
is 5120, but when I inspectrc_count
it is now 0. Certainly feels like a race. I additionally had a breakpoint at: https://github.com/openzfsonwindows/openzfs/blob/development/module/zfs/zio.c#L2420 but it did not get triggered.