vmware / open-vm-tools

Official repository of VMware open-vm-tools project
http://sourceforge.net/projects/open-vm-tools/
2.23k stars 425 forks source link

Cloudlinux7, Snapshots hang, then crash server when trying to Quiesce #372

Open Lonecrowe opened 4 years ago

Lonecrowe commented 4 years ago

Try to take a snapshot and Quiesce the OS it will immediately crash the server on a Cloudlinux7 machine using paravirtual drivers.

I have open vm tools on some Centos 7 servers that seem to work fine with the tools, paravirtual and LSI.

I've been unable to get any meaningful logs because when I attempt to do this it crashes the server for a good 15 mins. It takes at least that long to cancel the snap properly and hard reboot the server.

Can someone point out where I should be gathering any crash logs? I added the debug section to the tools. Problem is reproducing it will take the server down.

It also reports Cloudlinux as Other 3.x or later Linux (64-bit)

dsouders commented 4 years ago

I don't know about debugging CloudLinux, but perhaps this will help:

https://cloudlinux.zendesk.com/hc/en-us/articles/115004538245-How-do-I-install-and-configure-kdump-if-a-server-hangs

To enable tools debug logging, please refer to:

https://kb.vmware.com/s/article/1007873

Lonecrowe commented 4 years ago

Right thanks - I did enable that debugging. I had not seen anything other than it mentioning things are timing out. Kdump is already on and there is no core dump because it locks so hard nothing is available. I have to cancel the snapshot, wait 15 mins then hard shut down the VM.

Thanks though :)

Snapshots of memory and no memory no queisce work fine though.

Lonecrowe commented 4 years ago

I mean in essence isn't quiescing just telling the OS that "its ok" we are unplugging the ability to write any new data to the hard drive for a while until we back things up - hold onto your data and write it later. So if it locks up it doesn't write to log or anything. The whole system locks.

Cloudlinux has an LVE manager and basically sticks all accounts into their own caged file system and amount of shared resources.

dsouders commented 4 years ago

Quiescing syncs everything to disk to get a consistent snapshot. You can work around that aspect of it by disabling the sync driver. See: https://communities.vmware.com/thread/493329

Lonecrowe commented 4 years ago

Oddly enough there isn't a disk.EnableUUID in my vmx file. There are some UUID's for the bios, vc etc..

dsouders commented 4 years ago

disk.EnableUUID is not present in the config file (vmx file) by default. This page has a little more info: https://kb.vmware.com/s/article/2079220

Lonecrowe commented 4 years ago

"Note: The disk.EnableUUID parameter is not included in the .vmx file by default. If the parameter does not exist, it is taken as false since this is the default behavior. "

So it says if its not there it already is considered false, which is what I want.

Also tools.conf didnt exist and I just created it.. but looking at the article I seem to remember seeing a couple of those debug logs but I've been unable to find them written to any of the logs to confirm.

I saw this during one of the crashes or something similar.

"GuestRpcSendTimedOut: message to toolbox timed out."

Lonecrowe commented 4 years ago

Is there a log on the HOST itself I can tail to see what is happening? I'm tailing the actual vmware.log in the datastore where the files reside.

On the guest there are /var/log/vmware-vmsvc.log files but as soon as I start the snap that shell is locked hard.

Lonecrowe commented 4 years ago

OMG it worked. Snapshots are working after I added the line. Holy crap. Backups are working too. I'll have to test on the other server. I'm not sure if its the tools.conf change or the vmx change that did it.

Bu if this works I owe you a case of beer :)

dsouders commented 4 years ago

Ah, right, disk.EnableUUID defaults to FALSE.

Disabling the sync driver might still help, though, because there could be something running in the guest that is sensitive to disk latency:

  1. Add these parameters to the file:

[vmbackup] enableSyncDriver = false

Note: This only runs a sync operation before the snapshot, and does not run a FREEZE on the filesystem.

"GuestRpcSendTimedOut: message to toolbox timed out."

The GuestRpcSendTimedOut message is a symptom of the guest being hung. There may be other messages in the log indicating Tools is not responding.

Is there a log on the HOST itself I can tail to see what is happening? I'm tailing the actual vmware.log in the datastore where the files reside.

I'm confused as to your configuration. The vmware.log file should not be affected by quiescing the file systems inside the guest. What product are you using (esx, workstation, fusion)?

Lonecrowe commented 4 years ago

esxi / vcenter. These are all cli based linux servers. No the vmware.log file on the datastore is responding correctly. The file in the guest itself (the vm) would lock hard.

So I'll test this on the 2nd server and create an /etc/vmware-tools/tools.conf and add that directive and see if that is indeed what fixed it.

dsouders commented 4 years ago

I'm glad it's working for you. Please note that taking snapshots with the sync driver disabled does not "freeze" the guest file systems, which means there could be some inconsistencies in the snapshots if they are taken while there is ongoing disk activity.

Lonecrowe commented 4 years ago

Well it could be a missed email or something like that but the backups run late at night and each domain is fully backed up as well offsite so I think we'll be ok.

This second server is taking a LONG time to snapshot and is freezing things up. So I think it is a combination of both changes.

Lonecrowe commented 4 years ago

Yep looks like it worked on both. However this still is an "issue" with cloudlinux it seems.

dsouders commented 4 years ago

Thanks! An internal bug has been filed to track this issue.

ravindravmw commented 4 years ago

@Lonecrowe it is not very clear what setting did you have to add to make it work? Also, I'd like to understand it little more. So, on a failing VM, could you please create a /etc/vmware-tools/tools.conf with following entries:

[logging] vmsvc.level=debug vmsvc.handler=vmx vmbackup.level=debug vmbackup.handler=vmx

And, then reproduce the issue. This will generate logs in vmware.log under VM's directory. You can collect and share vmware.log with us for analysis of this issue.

Lonecrowe commented 4 years ago

Difficult to reproduce the issue because it crashes a whole bunch of people's domains.

[vmbackup] enableSyncDriver = false

seems to fix it but I also added the disable UUID as well in both.

Open tools does not recognize the OS as cloudlinux - just an older version of linux. Cloudlinux also has LVE manager which limits connections and cages each domain's resources. I wonder if that specific kernel has something to do with the compatibility?

ravindravmw commented 4 years ago

We just call FIFREEZE/FITHAW IOCTLs to quiesce/resume the disk IO inside the guest, it is possible the this Linux kernel handles these IOCTLs differently.

maxio-co commented 4 years ago

Hi Team

Experiencing the same:

open-vm-tools-10.3.0-2.el7_7.1.x86_64

# cat /etc/*-release
CloudLinux release 7.7 (Valery Bykovsky)
NAME="CloudLinux"
VERSION="7.7 (Valery Bykovsky)"

have found similar articles on VMware KB with this problem.

https://kb.vmware.com/s/article/1018194 https://kb.vmware.com/s/article/2090545

Any updates on a fix for open-vm-tools with CloudLinux 7?

ill try the above steps and report back.

PS out of curiosity how much Ram does do your VM's have?

Ours is currently allocated 32 GB, wondering if Ram allocation plays a role in snapshot time out ?

""Cheers G

maxio-co commented 4 years ago

Just reporting back, these edits have resolved the VM crashing for me but not the lack of quiesce issue.

thank you. G