Open ukguy opened 8 years ago
I am also experiencing this issue with the following:
Centos 7.2 XFS LVM Latest Cpanel 4GB RAM.
In my case the VM actually panicked and needed a reboot.
Yes, that's exactly what happens here. This whole quiescing issue is abit of a mystery. I originally experienced it 2.5 years ago in Centos 6 and after alot of research the conclusion was that it was a kernel issue with the freeze/thaw in Centos.
However, this is our situation now:
Centos 6.8 VM ext4 - vmware native tools Quiescing off - or it crashes
Centos 7.3 VM xfs - open vmware tools Quiescing on - works perfectly (Native tools crashes)
Centos 7.3 VMs (x2) ext 4 - vmware native tools Quiescing on - works perfectly (Open tools crashes)
So you'd be left thinking..."Centos 7 ext4 works perfectly with native tools".
No...not always.
We set up exactly the same VM for a new customer: Centos 7.3 VM ext 4 VMWare native tools - Quiescing on - crashes Open vmtools - - Quiescing on - crashes
There is no difference at all between the working Centos 7 VM's and the one which crashes. We spent hours changing configs, setting up XFS and Ext4 etc. Something I did find is that it did not crash when quiescing BEFORE Cpanel was installed in some cases, again not consistent as far as I can remember.
As it stands now, quiescing is off on this VM, we couldn't invest anymore time. One day we'll reach a definitive conclusion!
Please always clarify what is the version of VMware Tools and open-vm-tools being used.
When you say "crashes", which component does it refer to? vmtoolsd process or the guest kernel? Would it be possible to recreate the issue with VMware Tools with following contents in /etc/vmware-tools/tools.conf?
[logging] vmtoolsd.level=debug vmtoolsd.handler=vmx vmsvc.level=debug vmsvc.handler=vmx vmbackup.level=debug vmbackup.handler=vmx
You will need to give 5 seconds to vmtoolsd to pick the config file changes and then attempt reproduction of this issue. After reproduction please collect vmware.log from the VM's directory. If it is vmtoolsd crash, please collect its coredump as well. Either open a support case with all the data or send it over in an email.
Hi thanks for the follow up. Much appreciated. We spent days and hours logging stuff like this trying to determine the issue. We tried all versions of tools, older native and latest open. When it was only on the original centos 6 vm we just resigned to the face it was the kernel thaw/freeze bug which is highlighted by VMware and red hat. We later had 2 vms ok so it wasn't until last year when it reoccurred randomly it came back to light. At the time I did the logging etc and couldn't determine the cause. I'm unable to do it again as they are all now production vms but if I get a few hours of time well set up a new vm and run some tests with the logs on.
To follow up a year later on this if anyone reads it - it doesn't whether quiescing works or not, your VM's may be corrupt. We only discovered this recently. VM's which had been replicated with VMTools quiescing on also displayed file consistency errors when they were booted to test. The file system needed repairing with fsck or xfs_repair before they could be used. We have looked into this extensively and the general consensus with our backup providers is that we are unsure what vmtools quiescing is actually doing, if anything, but it's certainly not creating a file consistent replica in our case. The story continues...
We expect same version of open-vm-tools and VMware Tools to behave almost the same. Please share the version details of open-vm-tools and VMware Tools.
The way quiescing works is, Host asks VMTools to quiesce the guest file system and VMTools invokes FIFREEZE IOCTL to flush-and-hold all filesystem I/O. This design is same in open-vm-tools as well as VMware Tools. If a filesystem or guest driver does not implement FIFREEZE IOCTL correctly, that might explain the behavior.
Could you please also share the details of what type of disk is attached to the VM? Could you poweroff the VM, and change controller type to "VMware Paravirtual" in VM settings and see if that makes any difference?
PS: If possible please share the logs I asked for in my previous update. That is needed for analyzing the quiescing error you were getting earlier.
Hi there, appreciate the followup. Since posting, something has come to light. Our replication software also takes "backups" which are stored in a repository. Restores from these backups appear to restore without corruption. We are testing if this is the case with all VM's, quiescing or not. This sheds a slightly different light as I believe the backups and replication both use the same CBT and VMTools snapshots to take the incremental updates. The only difference is the backups are taken at 1am, a less busy period, so this could point back to quiescing, but saying that I still find it surprising a replica taken on demand in the night still had corruption. I have raised this question with the backup software provider for their feedback. (They previous passed the blame to VMWare Tools or prefreeze etc)
I have read somewhere regarding the Paravirtual controller, which now seems to be default in new VM's. I trust switching to this does not affect the production VM when it starts up again. If so,we are happy to try this and run new replicas.
The disk is currently SCSI controller 0 - LSI Logic Parallel. The storage on the host is raid 10 SSD and the replica host is Raid 10 SATA.
I can look into getting the logs for you when we run further tests, thanks again for your help.
Same issue here. Though I'm running cloudlinux7. Worked fine on CL6 with no paravirtual drivers.
VMware Tools daemon, version 10.3.0.5330 (build-8931395)
Vmware hypervisor is 6.7xxx
I tried changing the drives to LSI and it wouldn't boot. I guess there is a trick to it but ideally this shouldn't happen for a snapshot that shouldn't take very long.
Incase anyone reads this I’d like to add that the corruption we had disappeared after changing our storage server. Therefore the quiescing issue is not related to the corruption.
Out of several vms we only now have 1 with quiescing still turned on. It’s off on all the others otherwise they freeze during snapshot. This happens with native or open tools regardless of version. I’ve not tried testing in the last year so can’t confirm if the issue disappeared. I’m sure it’s related to the freeze/thaw kernel.
However Linus has good journaling and crash recovery so now we’ve moved storage and the snapshots are notcirrupt they seem fine with quiescing off.
Hi, Centos 7.2 - ext4 LVM Latest Cpanel ESXi VM 4GB ram
Installed Centos 7, installed open-vm-tools, installed cpanel, tested snapshot in vsphere client with quiescing, performed ok. Then shortly afterwards, tried snapshot again and got "Error while quiescing snapshot, etc, error 3".
Spent several hours looking into it trying to work out what triggered the error. I then reverted back to the replica of the VM when it was working ok and that also had the error.
As a last resort I uninstalled open-vm-tools and installed the standard vmware tools, the quiescing then worked flawlessly.
The only difference I can see is that open tools is version 9.x and the standard tools is 10.x in ESXi 6u2.