System hang while removing root filesystem snapshot

gpenin commented 1 day ago

Hi,

Since many years, we are using a backup system that creates an alternative boot environment, using LVM snapshots. We have a main system volume group, called VG1, and an alternative volume group, called VG2. In VG1, we have : rootfs, var, home, etc... logical volumes.

Weekly, we create VG2 logical volumes, based on the VG1 logical volumes content, using snapshots of VG1 logical volumes and a raw copy (using a "cat /dev/VG1/snap_LVNAME > /dev/VG2/LVNAME" command).

Last week, we encounter a system hang that we are not able to explain (nor reproduce).

As of now, here are the elements we have :

The operating system is Ubuntu 22.04 LTS + LVM2 2.03.11-2.1ubuntu4 hosted on an AHV (KVM) wirtual environment.
There are no relevant metrics on the virtual infrastructure that can explain this system hang.
The system was not overloaded (CPU, memory, etc...) at all : in fact, some processes were able to log in the /var filesystem during 5 to 10 minutes after the snapshot removal of the rootfs was launched.
In the meantime, we could see some processes hung in D state (jbd2, kworker, etc...), all attached to the same device (but we cannot be sure that it was the rootfs).
We are unable to reproduce this kind of hang (despite many, many tries...), unless overloading the system with a "stress tool" like stress-ng.
After a "forced" reboot, we saw that /dev/VG2/rootfs was clean and up-to-date, and /dev/VG1/snap_rootfs was still there.
By analyzing all the elements, and knowing that removing a snapshot for the rootfs is a critical step (uses mlocked memory, needs to freeze the underlying rootfs logical volume I/O, etc...), we are pretty sure that the system hung during the "lvremove /dev/VG1/snap_rootfs" command.

console cpu

Any help would be greatly appreciated.

Regards,

zkabelac commented 1 day ago

Hi

Your report is actually missing some info needed for better analysis. When you say 'snapshot' - do you mean 'old/thick' or 'new/thinly provisioned' one ?

For the old 'snapshot' if it's been the 'last' snapshot of an LV - there need to be reloaded table line for a device - thus lvm2 is doing full 'flush' & 'fsfreeze' of your rootfs - and this may eventually cause some hard to explain/analyze troubles - theoretically all should work - however the model was designed more then 20 years back in time ;) and from that era many things has changed inside the kernel and userspace...

If you system would have a lot of memory & lot of dirty pages & slow storage device - it might take 'a long moment' until all the in-flight operation land on your drive - however you are not giving lot of logs from your system to better understand time sequence of your operations and what all was possible to happen.

But anyway long story short - if we don't have any more data in your case I'd primarily suspect:

https://github.com/lvmteam/lvm2/commit/a3eb6ba425773224076c41aabc3c490a6a016ee6

which is part of 2.03.17 release (and I'll need comment about this commit to WHATS_NEW part there as likely it's been forgotten to be mentioned).

Theoretically some 'libaio' part could have been possibly paged out - and this will freeze the command if it happens on a frozen rootfs - like this would be the easiest explanation - although it's kind of hard to trigger - we have seen already one such case in past - and that's why it has been fixed...

So please use more recent versions of lvm2...

gpenin commented 21 hours ago

Hi @zkabelac,

Thanks a lot for your help. Unfortunately, we do not have much more logs as the system was totally frozen and we must "force reboot" it in order to resolve an associated production incident (some clients did not switch to another healthy system as they should but this is another problem).

This is a pretty old system backup script ant it uses "old/thick" snapshots.

The scenario you give is one we had in mind because we quickly saw that not "mlocking" some libraries could drive us to similar freezes :

However, metrics didn't show us specific memory contention at the time the system froze (1 GB of usable memory / 2 GB total memory). But, we could see that :

The /dev/VG1/snap_rootfs was still there after the reboot
The /dev/VG2/rootfs copy was clean after the reboot
Some memory was flushed during the script execution (but probably before the snapshot removal) :

memory

Blue line : "all" memory used
Yellow line : buffered memory used
Purple line : cached memory used

So the system was for sure in our script somewhere between the end of the copy and the beginning of the snapshot removal.

We'll ask the OS support an LVM2 version upgrade, or at least a cherry-pick of the mentionned patch in the 2.03.17 version.

May I ask you if switching "use_mlockall" to 1 in lvm.conf could be a safe workaround in the meantime ?

Regards,

zkabelac commented 8 hours ago

First yes - mlockall should 'mask' the issue at the price of considerably bigger/longer CPU time used to actually lock all that memory - and also the extra RAM being locked as well.

Also note - the problem with the locking is not just when your system is low on RAM and needs to swap - but also in any case where system decided to 'page-in' some more pages from your binary which was possibly not needed until some moment - or maybe system decided to make more free space for your page cache directory to handle better file copying, but these issues are really hard to chase and debug.

What I also tend to recommend is to significantly lower the amount of allowed 'dirty pages' in your system - this may often actually improve some 'workloads' (and maybe slower some others.... all depends - you need to benchmark). However for sure it will make those 'tasks' that are waiting for disk flushing and suspend way more 'time predictable'....

lvmteam / lvm2

System hang while removing root filesystem snapshot #159