vmware / open-vm-tools

Official repository of VMware open-vm-tools project
http://sourceforge.net/projects/open-vm-tools/
2.27k stars 427 forks source link

Regression 2:10.3.10-1+deb10u2 -> 2:11.2.5-2+deb11u1 / vm hang on quiescense #650

Open flohoff opened 1 year ago

flohoff commented 1 year ago

Describe the bug

On quiecense snapshot VM hangs.

Reproduction steps

Issue a snapshot with filesystem quiescensing.

Expected behavior

Snapshot will be created

Additional context

After upgrading a Galera Cluster from Buster to Bullseye the Veeam backups caused the cluster to die. On investigation it showed that the Snapshots with filesystem quiescensing caused the VM to die/hang until reboot.

I started on Buster with 2:10.3.10-1+deb10u2, upgraded to 2:11.2.5-2+deb11u1. Then i began seeing the issue every time the backup was scheduled. So i immediatly upgraded to 2:12.1.0-2~bpo11+1 from Bullseye-Backports which still had the issue.

After debugging a lot we could not pinpoint a specific config item to cause this. When i turn off logging it works, reenabling logging does not bring back the issue reliably. Test setups have not shown this issue consistently so my guess is that its a race condition in enabling the log buffering in vmtoolsd.

I had the change to do an strace -f on vmtoolsd on one hang and the last lines were:

[pid 29258] close(11)                   = 0
[pid 29258] getpid()                    = 29258
[pid 29258] getpid()                    = 29258
[pid 29258] sendto(3, "<15>Mar 14 21:57:33 vmsvc[29258]"..., 105, MSG_NOSIGNAL, NULL, 0) = 105
[pid 29258] getpid()                    = 29258
[pid 29258] getpid()                    = 29258
[pid 29258] sendto(3, "<15>Mar 14 21:57:33 vmsvc[29258]"..., 139, MSG_NOSIGNAL, NULL, 0 <unfinished ...>
[pid 29266] <... futex resumed>)        = -1 ETIMEDOUT (Connection timed out)
[pid 29266] futex(0x5555e10d9830, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=999998561}) = -1 ETIMEDOUT (Connection timed out)
[pid 29266] futex(0x5555e10d9830, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=999998908}) = -1 ETIMEDOUT (Connection timed out)
[pid 29266] futex(0x5555e10d9830, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=999999283}) = -1 ETIMEDOUT (Connection timed out)
[pid 29266] futex(0x5555e10d9830, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=999999038}) = -1 ETIMEDOUT (Connection timed out)

So the sendto is unfinished hanging. When this happened my config looked like this:


[logging]
# Turns on logging globally. It can still be disabled for each domain.
log = true

# Disables core dumps on fatal errors; they're enabled by default.
# enableCoreDump = false

# Defines the "vmsvc" domain, logging to file
# vmsvc.level = message
vmsvc.handler = file
# Setup file rotation - keep 3 files
vmsvc.maxOldLogFiles = 3
# Max log file size kept: 1 MB
vmsvc.maxLogSize = 1

# Defines the "vmtoolsd" domain, and disable logging for it.
# vmtoolsd.level = none

# Enable "hgfsServer" request handling logging to the appropriate service file.
vmtoolsd.level = debug
vmtoolsd.handler = syslog
vmusr.level = debug
vmusr.handler = syslog
vmresset.level = debug
vmresset.handler = syslog
toolboxcmd.level = debug
toolboxcmd.handler = syslog
toolbox.level = debug
toolbox.handler = syslog
hgfsServer.level = debug
hgfsServer.handler = syslog
hgfsd.level = debug
hgfsd.handler = syslog
vmbackup.level = debug
vmbackup.handler = syslog
vmvss.level = debug
vmvss.handler = syslog
vmsvc.level = debug
vmsvc.handler = syslog

Log daemon was systemd-journald.

Then i killed all extra debug/syslog config and set logging to false, snapshots began to work. When i then reenabled logging they continued to work.

I am bit puzzled by this so my only explanation is that i am looking at a race condition in enabling the log buffer in vmtoolsd/vmbackup on fs freeze.

Flo

flohoff commented 1 year ago

Addition - Logging from journald (Running a journald -f in a second console) showed this, after which the machine hung:

Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [   debug] [vmsvc] [29258] RpcIn: received 16 bytes, content:"vmbackup.start 1"
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [   debug] [vmbackup] [29258] *** VmBackupStart
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [   debug] [vmsvc] [29258] VMTools_ConfigGetBoolean: Returning default value for '[vmbackup] quiesceApps'=TRUE (Not found err=3).
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [   debug] [vmsvc] [29258] VMTools_ConfigGetBoolean: Returning default value for '[vmbackup] quiesceFS'=TRUE (Not found err=3).
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [   debug] [vmsvc] [29258] VMTools_ConfigGetBoolean: Returning default value for '[vmbackup] allowHWProvider'=TRUE (Not found err=3).
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [   debug] [vmsvc] [29258] VMTools_ConfigGetBoolean: Returning default value for '[vmbackup] execScripts'=TRUE (Not found err=3).
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [   debug] [vmsvc] [29258] VMTools_ConfigGetString: Returning default value for '[vmbackup] scriptArg'=(null).
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [   debug] [vmsvc] [29258] VMTools_ConfigGetBoolean: Returning default value for '[vmbackup] vssUseDefault'=TRUE (Not found err=3).
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [   debug] [vmsvc] [29258] VMTools_ConfigGetBoolean: Returning default value for '[vmbackup] forceQuiesce'=FALSE (Not found err=3).
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [   debug] [vmsvc] [29258] VMTools_ConfigGetBoolean: Returning default value for '[vmbackup] enableSyncDriver'=TRUE (Not found err=3).
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [   debug] [vmsvc] [29258] VMTools_ConfigGetBoolean: Returning default value for '[vmbackup] enableNullDriver'=TRUE (Not found err=3).
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [   debug] [vmbackup] [29258] Using quiesceApps = 1, quiesceFS = 1, allowHWProvider = 1, execScripts = 1, scriptArg = , timeout = 10, enableNullDriver = 1, forceQuiesce = 0
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [   debug] [vmsvc] [29258] VMTools_ConfigGetString: Returning default value for '[vmbackup] excludedFileSystems'=(null).
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [   debug] [vmbackup] [29258] Using excludedFileSystems = "(null)"
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [   debug] [vmbackup] [29258] Quiescing volumes: (null)
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [   debug] [vmbackup] [29258] *** VmBackup_SendEventNoAbort
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [   debug] [vmbackup] [29258] Sending vmbackup event: vmbackup.eventSet reset 0
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [   debug] [vmsvc] [29258] RpcChannel: Sending: 27 bytes
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [   debug] [vmsvc] [29258] VSockChan: Sending request for conn 8,  reqLen=27
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [   debug] [vmsvc] [29258] SimpleSock: Sent 59 bytes from socket 8
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [   debug] [vmsvc] [29258] SimpleSock: Recved 4 bytes from socket 8
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [   debug] [vmsvc] [29258] SimpleSock: Recved 14 bytes from socket 8
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [   debug] [vmsvc] [29258] VSockOut: recved 2 bytes for conn 8
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [   debug] [vmsvc] [29258] RpcChannel: Recved 0 bytes
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [   debug] [vmbackup] [29258] *** VmBackupStartScripts
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [   debug] [vmbackup] [29258] Trying to run scripts from /etc/vmware-tools/backupScripts.d
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [   debug] [vmsvc] [29258] RpcIn: sending 3 bytes
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [   debug] [vmsvc] [29258] RpcIn: received 5 bytes, content:"ping\00"
Mar 14 21:56:48 svrb-ncgc3-prd vmsvc[29258]: [   debug] [vmsvc] [29258] RpcIn: sending 3 bytes
johnwvmw commented 1 year ago

@flohoff

Thank you for reporting this problem.

It definitely sounds as if there is an attempt to log to either a file or syslog after the filesystem has been frozen.

It has been a long standing practice for quiesced snapshots on Linux, that if logs are being directed to a "file", "file+" or "syslog", a buffer is preallocated to hold log messages until the filesystem(s) are unfrozen. Only after the buffer is allocated are the Linux filesystems frozen.

This behavior has not changed in all versions of open-vm-tools mentioned in this issue.

If another process freezes the file system before the buffer is allocated, vmtoolsd will become blocked.

You provided a snippet of the tools.conf file in use on the Debian 11 system. in the snippet, the vmsvc.handler was directed to a "file" and to "syslog". That should not be a problem, the last setting should be used. It would help to see the full tools.conf file however.

I am assuming that before the upgrade of the Galera Cluster, Veeam backups and vmtools logging to "syslog" were being used.

The journald output shows that the vmbackup task is about to run scripts from /etc/vmware-tools/backupScripts.d. Are there any scripts/programs in that directory.

You mentioned: Then i killed all extra debug/syslog config and set logging to false, snapshots began to work. When i then reenabled logging they continued to work.

If the problem persists, diagnosing this issue requires more detailed analysis with vm-support bundle from the host. We do not encourage sharing vm-support bundles on public forums and the size would likely be more than permitted on github.com. Please contact VMware Support service for further diagnosis.

johnwvmw commented 1 year ago

Ping @flohoff.