lvmteam / lvm2

Mirror of upstream LVM2 repository
https://gitlab.com/lvmteam/lvm2
GNU General Public License v2.0
131 stars 71 forks source link

System hang while removing root filesystem snapshot #159

Open gpenin opened 1 day ago

gpenin commented 1 day ago

Hi,

Since many years, we are using a backup system that creates an alternative boot environment, using LVM snapshots. We have a main system volume group, called VG1, and an alternative volume group, called VG2. In VG1, we have : rootfs, var, home, etc... logical volumes.

Weekly, we create VG2 logical volumes, based on the VG1 logical volumes content, using snapshots of VG1 logical volumes and a raw copy (using a "cat /dev/VG1/snap_LVNAME > /dev/VG2/LVNAME" command).

Last week, we encounter a system hang that we are not able to explain (nor reproduce).

As of now, here are the elements we have :

console cpu

Any help would be greatly appreciated.

Regards,

zkabelac commented 1 day ago

Hi

Your report is actually missing some info needed for better analysis. When you say 'snapshot' - do you mean 'old/thick' or 'new/thinly provisioned' one ?

For the old 'snapshot' if it's been the 'last' snapshot of an LV - there need to be reloaded table line for a device - thus lvm2 is doing full 'flush' & 'fsfreeze' of your rootfs - and this may eventually cause some hard to explain/analyze troubles - theoretically all should work - however the model was designed more then 20 years back in time ;) and from that era many things has changed inside the kernel and userspace...

If you system would have a lot of memory & lot of dirty pages & slow storage device - it might take 'a long moment' until all the in-flight operation land on your drive - however you are not giving lot of logs from your system to better understand time sequence of your operations and what all was possible to happen.

But anyway long story short - if we don't have any more data in your case I'd primarily suspect:

https://github.com/lvmteam/lvm2/commit/a3eb6ba425773224076c41aabc3c490a6a016ee6

which is part of 2.03.17 release (and I'll need comment about this commit to WHATS_NEW part there as likely it's been forgotten to be mentioned).

Theoretically some 'libaio' part could have been possibly paged out - and this will freeze the command if it happens on a frozen rootfs - like this would be the easiest explanation - although it's kind of hard to trigger - we have seen already one such case in past - and that's why it has been fixed...

So please use more recent versions of lvm2...

gpenin commented 21 hours ago

Hi @zkabelac,

Thanks a lot for your help. Unfortunately, we do not have much more logs as the system was totally frozen and we must "force reboot" it in order to resolve an associated production incident (some clients did not switch to another healthy system as they should but this is another problem).

This is a pretty old system backup script ant it uses "old/thick" snapshots.

The scenario you give is one we had in mind because we quickly saw that not "mlocking" some libraries could drive us to similar freezes :

However, metrics didn't show us specific memory contention at the time the system froze (1 GB of usable memory / 2 GB total memory). But, we could see that :

memory

So the system was for sure in our script somewhere between the end of the copy and the beginning of the snapshot removal.

We'll ask the OS support an LVM2 version upgrade, or at least a cherry-pick of the mentionned patch in the 2.03.17 version.

May I ask you if switching "use_mlockall" to 1 in lvm.conf could be a safe workaround in the meantime ?

Regards,

zkabelac commented 8 hours ago

First yes - mlockall should 'mask' the issue at the price of considerably bigger/longer CPU time used to actually lock all that memory - and also the extra RAM being locked as well.

Also note - the problem with the locking is not just when your system is low on RAM and needs to swap - but also in any case where system decided to 'page-in' some more pages from your binary which was possibly not needed until some moment - or maybe system decided to make more free space for your page cache directory to handle better file copying, but these issues are really hard to chase and debug.

What I also tend to recommend is to significantly lower the amount of allowed 'dirty pages' in your system - this may often actually improve some 'workloads' (and maybe slower some others.... all depends - you need to benchmark). However for sure it will make those 'tasks' that are waiting for disk flushing and suspend way more 'time predictable'....