Open guenther-alka opened 3 months ago
hmm wonder what it could be - I'll have to try to replicate it
Such a blocking situation can occur with ZFS on OSX or Windows more often than on a filer with a ZFS raid that allows a disk failure without problems. ZFS on USB sticks or single disk USB backup pools where you forget the zpool export prior remove is critical.
Yeah we do have that on macOS, but as you say, you can reboot - and I recently did work on zpool clear
to reopen devices that were renumbered.
But stopping a reboot isn't quite what we want - unless it is!
I do lock the driver busy when a pool is imported, so you can't unload it. Perhaps it should unlock one when suspended as well.
I created a vhdx file, pool inside, then detached vhdx and pool gets suspended. Then asked debugger to dump all the processes.... 2 hours later I am still waiting for it to finish with the processes list. It's not stuck, but sure is running like molasses.
ok took 12 hours, but it did complete, so just really slow. Checked CPU when resuming VM and it was mostly idle, so not on CPU, but maybe locking something. Then issued reboot which was just doing the spinning circle thing forever. Went back into debugger to dump the processes again - should only show stuck processes this time, since the rest should have been terminated in the pre-reboot phase. I suspect a reboot will eventually work, like if you wait 12 hours but I aint testing that theory. But meanwhile, tomorrow I'll check the stacks collected.
Right, so stack looking like this is what remains
[ffff878f405df180 ServiceHub.Ide]
3a3c.003498 ffff878f39b27080 0002ab5 Blocked nt!KiSwapContext+0x76
nt!KiSwapThread+0xab5
nt!KiCommitThreadWait+0x137
nt!KeWaitForMultipleObjects+0x306
OpenZFS!spl_cv_wait+0xea
OpenZFS!txg_wait_synced_impl+0x290
OpenZFS!txg_wait_synced+0x29
OpenZFS!dmu_tx_wait+0x1e5
OpenZFS!dmu_tx_assign+0x17e
OpenZFS!zfs_inactive+0x13a
OpenZFS!vnode_put+0x141
OpenZFS!fastio_query_open+0x2a3
+0xfffff8043191ab86
for processes like: vmtoolsd.exe explorer.exe OpenWith.exe. The worst of which appears to be the shown one, ServiceHub.IdentityHost.exe.
It is correct for us to call zfs_inactive()
from vnode_put()
and sync
from there. But that is Unix design, so perhaps we will need to compromise on Windows.
That sync is atime - so one way to avoid it is setting atime=off.
regarding atime=off
atime is a zfs filesystem not a zpool pool property so you cannot disable at zpool creation. How can one disable atime for a pool filesystem itself permamently?
Even a pool unmount && zfs set atime off pool && mount pool reverts to atime on, zfs mount with -o noatime seems not working and is not a permament option.
When you create a pool "POOL" it also creates the root dataset "pool", so you use capital -O
to set options for it, like
zpool create -O atime=off POOL diskX...
. But atime can be changed anytime, so just zfs set atime=off POOL
I see the following behaviour:
PS C:\Users\me> zpool create -o ashift=12 -O atime=off -f usb physicaldrive3 Expanded path to '\?\physicaldrive3' working on dev '#1048576#6431965184#\?\physicaldrive3' setting path here '/dev/physicaldrive3' setting physpath here '#1048576#6431965184#\?\physicaldrive3'
PS C:\Users\me> zfs get atime usb NAME PROPERTY VALUE SOURCE usb atime on temporary
PS C:\Users\me> zfs set atime=off usb cannot mount 'usb': Unknown error property may be set but unable to remount filesystem
PS C:\Users\me> zfs get atime usb NAME PROPERTY VALUE SOURCE usb atime on temporary
The "temporary" is unexpected, but that it doesn't remount is as expected (no need in Windows, should remove the message).
@lundman writes:
but that it doesn't remount is as expected (no need in Windows,
Do you mean that NT kernel architecture allows flipping (stopping or starting) a FS access time updates on the fly?
should remove the message).
If so, wouldn't it be better to remove the remount call (the reason) instead of removing the error message (the consequence) occurring after the worthless try and a subsequent failure?
Do you mean that NT kernel architecture allows flipping (stopping or starting) a FS access time updates on the fly?
Yeah, Windows don't have mounts like Unix, and it's all internal.
If so, wouldn't it be better to remove the remount call (the reason) instead of removing the error message (the consequence) occurring after the worthless try and a subsequent failure?
Sure, I meant merely "to fix", whatever that entails.
8e21206d573e0e216afcc58027494624ffcab678
When you remove a disk with a basic pool without a prior export, this results in io error and pool suspended without a reset option. I expect this as a normal ZFS behaviour. Normally (on Solaris/Illumos) I reboot the OS with the option to remount and use the pool again. On Windows the reboot seem to hang at least for a few minutes that I have waited. A blocking of reboot on a hanging ZFS pool should not be the case.
System information
Type | Version/Name Windows 11 23H2 Distribution Name | Distribution Version | Kernel Version | Architecture | OpenZFS Version |
zfs version 2.2.3rc2
Describe the problem you're observing
Describe how to reproduce the problem
Include any warning/errors/backtraces from the system logs
A premium solution would be a "force export" but at least a reboot should be possible https://github.com/openzfs/zfs/pull/11082