Open flohoff opened 1 year ago
Addition - Logging from journald (Running a journald -f in a second console) showed this, after which the machine hung:
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [ debug] [vmsvc] [29258] RpcIn: received 16 bytes, content:"vmbackup.start 1"
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [ debug] [vmbackup] [29258] *** VmBackupStart
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [ debug] [vmsvc] [29258] VMTools_ConfigGetBoolean: Returning default value for '[vmbackup] quiesceApps'=TRUE (Not found err=3).
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [ debug] [vmsvc] [29258] VMTools_ConfigGetBoolean: Returning default value for '[vmbackup] quiesceFS'=TRUE (Not found err=3).
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [ debug] [vmsvc] [29258] VMTools_ConfigGetBoolean: Returning default value for '[vmbackup] allowHWProvider'=TRUE (Not found err=3).
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [ debug] [vmsvc] [29258] VMTools_ConfigGetBoolean: Returning default value for '[vmbackup] execScripts'=TRUE (Not found err=3).
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [ debug] [vmsvc] [29258] VMTools_ConfigGetString: Returning default value for '[vmbackup] scriptArg'=(null).
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [ debug] [vmsvc] [29258] VMTools_ConfigGetBoolean: Returning default value for '[vmbackup] vssUseDefault'=TRUE (Not found err=3).
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [ debug] [vmsvc] [29258] VMTools_ConfigGetBoolean: Returning default value for '[vmbackup] forceQuiesce'=FALSE (Not found err=3).
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [ debug] [vmsvc] [29258] VMTools_ConfigGetBoolean: Returning default value for '[vmbackup] enableSyncDriver'=TRUE (Not found err=3).
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [ debug] [vmsvc] [29258] VMTools_ConfigGetBoolean: Returning default value for '[vmbackup] enableNullDriver'=TRUE (Not found err=3).
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [ debug] [vmbackup] [29258] Using quiesceApps = 1, quiesceFS = 1, allowHWProvider = 1, execScripts = 1, scriptArg = , timeout = 10, enableNullDriver = 1, forceQuiesce = 0
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [ debug] [vmsvc] [29258] VMTools_ConfigGetString: Returning default value for '[vmbackup] excludedFileSystems'=(null).
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [ debug] [vmbackup] [29258] Using excludedFileSystems = "(null)"
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [ debug] [vmbackup] [29258] Quiescing volumes: (null)
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [ debug] [vmbackup] [29258] *** VmBackup_SendEventNoAbort
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [ debug] [vmbackup] [29258] Sending vmbackup event: vmbackup.eventSet reset 0
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [ debug] [vmsvc] [29258] RpcChannel: Sending: 27 bytes
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [ debug] [vmsvc] [29258] VSockChan: Sending request for conn 8, reqLen=27
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [ debug] [vmsvc] [29258] SimpleSock: Sent 59 bytes from socket 8
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [ debug] [vmsvc] [29258] SimpleSock: Recved 4 bytes from socket 8
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [ debug] [vmsvc] [29258] SimpleSock: Recved 14 bytes from socket 8
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [ debug] [vmsvc] [29258] VSockOut: recved 2 bytes for conn 8
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [ debug] [vmsvc] [29258] RpcChannel: Recved 0 bytes
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [ debug] [vmbackup] [29258] *** VmBackupStartScripts
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [ debug] [vmbackup] [29258] Trying to run scripts from /etc/vmware-tools/backupScripts.d
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [ debug] [vmsvc] [29258] RpcIn: sending 3 bytes
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [ debug] [vmsvc] [29258] RpcIn: received 5 bytes, content:"ping\00"
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [ debug] [vmsvc] [29258] RpcIn: sending 3 bytes
@flohoff
Thank you for reporting this problem.
It definitely sounds as if there is an attempt to log to either a file or syslog after the filesystem has been frozen.
It has been a long standing practice for quiesced snapshots on Linux, that if logs are being directed to a "file", "file+" or "syslog", a buffer is preallocated to hold log messages until the filesystem(s) are unfrozen. Only after the buffer is allocated are the Linux filesystems frozen.
This behavior has not changed in all versions of open-vm-tools mentioned in this issue.
If another process freezes the file system before the buffer is allocated, vmtoolsd will become blocked.
You provided a snippet of the tools.conf file in use on the Debian 11 system. in the snippet, the vmsvc.handler was directed to a "file" and to "syslog". That should not be a problem, the last setting should be used. It would help to see the full tools.conf file however.
I am assuming that before the upgrade of the Galera Cluster, Veeam backups and vmtools logging to "syslog" were being used.
The journald output shows that the vmbackup task is about to run scripts from /etc/vmware-tools/backupScripts.d. Are there any scripts/programs in that directory.
You mentioned:
Then i killed all extra debug/syslog config and set logging to false, snapshots began to work. When i then reenabled logging they continued to work.
What is the status of the Debian 11 VM(s) at this time?
Do the snapshots continue to work following
systemctl restart vmtoolsd
shutdown / reboot
If the problem persists, diagnosing this issue requires more detailed analysis with vm-support bundle from the host. We do not encourage sharing vm-support bundles on public forums and the size would likely be more than permitted on github.com. Please contact VMware Support service for further diagnosis.
Ping @flohoff.
Describe the bug
On quiecense snapshot VM hangs.
Reproduction steps
Issue a snapshot with filesystem quiescensing.
Expected behavior
Snapshot will be created
Additional context
After upgrading a Galera Cluster from Buster to Bullseye the Veeam backups caused the cluster to die. On investigation it showed that the Snapshots with filesystem quiescensing caused the VM to die/hang until reboot.
I started on Buster with 2:10.3.10-1+deb10u2, upgraded to 2:11.2.5-2+deb11u1. Then i began seeing the issue every time the backup was scheduled. So i immediatly upgraded to 2:12.1.0-2~bpo11+1 from Bullseye-Backports which still had the issue.
After debugging a lot we could not pinpoint a specific config item to cause this. When i turn off logging it works, reenabling logging does not bring back the issue reliably. Test setups have not shown this issue consistently so my guess is that its a race condition in enabling the log buffering in vmtoolsd.
I had the change to do an strace -f on vmtoolsd on one hang and the last lines were:
So the sendto is unfinished hanging. When this happened my config looked like this:
Log daemon was systemd-journald.
Then i killed all extra debug/syslog config and set logging to false, snapshots began to work. When i then reenabled logging they continued to work.
I am bit puzzled by this so my only explanation is that i am looking at a race condition in enabling the log buffer in vmtoolsd/vmbackup on fs freeze.
Flo