Open thieleon opened 11 months ago
Do you have any LVM structures on that subvol? UDEV on the host might see them and activate the volume group, blocking the /dev/zd* device and through that all zfs operations on the dataset...
No it's just a proxmox containers file system (debian) without any LVM.
Has lsof
some file handles to report?
On 28 November 2023 21:28:53 GMT+03:00, thieleon @.***> wrote:
No it's just a proxmox containers file system (debian) without any LVM.
-- Reply to this email directly or view it on GitHub: https://github.com/openzfs/zfs/issues/15422#issuecomment-1830444745 You are receiving this because you commented.
Message ID: @.***>
I can't see that now, as I restarted the system already. I will check next time for that. But as far as I can tell, there is no handle.
Problem is back, but 'lsof' does not output anything when running
lsof /rpool/data/subvol-0815-disk-0/
any other idea.
BTW. ZFS is updated now to:
$ zfs --version
zfs-2.2.2-pve1
zfs-kmod-2.2.2-pve1`
There might be something in the modern vodoo (which isolates the container from the rest of the system) which was not cleared up cleanly when tearing down the container. Was a container running from that subvol, while the commands failed? Or was (past tense) one running in between boot and the time you tried to execute these commands? Have you tried to start and shutdown the related container, to see if that maybe clears the error?
I know, it's a bit like poking around in a dense fog with a short stick...
System information
Describe the problem you're observing
I'm running a Proxmox Cluster with a couple of nodes that use ZFS for storage. From time to time when operating on the containers via the proxmox UI, the jobs get stuck and then after some time (hours/day/..) I cancel the job but the job seems to still run in the background. Any operation on that subvol will from thereon fail with "zsf: error: cannot rollback/remove/.. 'rpool/data/subvol-ABC-disk-0': dataset busy"
I can resolve this issue by rebooting. This is not a solution for me as we run multiple customer services on this and takes quite some down time to reboot the node of proxmox.
Describe how to reproduce the problem
Pretty hard to reproduce as this seem to occur randomly from time to time. The operations like removing a snapshot/rolling back are done pretty often, but just sometimes there run forever.
There many people also in the proxmox community with the same problem, but they only suggest to "reboot".
Include any warning/errors/backtraces from the system logs
with
ps auxf | grep POOL_ID
I can see that there are two jobs running and blocked state "D":I tried killing them, but this is not working. Also
lsof | grep POOL_ID
is not showing anything. The output offuser -mv POOL_PATH
:dmesg show these errors, maybe they help:
Any help is highly appreciated, I run out of ideas.