Open Tontonitch opened 6 years ago
This means that vmtoolsd is not being run as frequently as it expects. Usually this means your system is under-resourced for the load it's running. This could be either a guest level issue or a host level issue (or sometimes both). It sounds like you’re seeing it on multiple guests. Are all these guests on the same host?
Generally, you’ll want to check the load on the guest(s) and/or the host(s) to see whether they are CPU bound or i/o bound, and adjust resources accordingly (or adjust the workload if that’s an option).
Looking at the time stamps on the log messages, it seems like this is generally occurring on the hour. By any chance does this correspond to cron jobs being initiated on each system?
If so and if multiple jobs are being started at the same time, you may want to explore a staggered start time.
Looking at the time stamps on the log messages, it seems like this is generally occurring on the hour.
We have hourly snapshots taken for all our vm, from vcenter. So sometime during that operation it generates that warning.
Usually this means your system is under-resourced for the load it's running.
Not the vm guests for sure, as it happen also on servers with low load. Not on ESX side, as reported by our VMware management team.
I did some investigations and it seems that this warning started to be reported after the update of the open-vm-tools from 10.0.5.52125 (build-3227872) to 10.1.5.59732 (build-5055683).
See the filtered vmware-vmsvc.log files attached, taken from various servers.
@Tontonitch
Re:
I did some investigations and it seems that this warning started to be reported after the update of the open-vm-tools from 10.0.5.52125 (build-3227872) to 10.1.5.59732 (build-5055683).
This warning message was added between those two releases, so most likely the condition that is being warned of predated your upgrade to 10.1.5.
How many VMs do you see the problem on? Are they all RHEL 7.5? Are they all running on the same ESX server, or are there multiple servers involved? What version(s) of ESX are you running?
Are multiple VMs running snapshots at the same time, or are you staggering them through the course of an hour in order to minimize any overlap in time among the snapshot operations on the VMs within a given server?
Can you send me a complete vmtools log and vmware.log for one VM where you see the problem? Since they may contain sensitive informatoin, don’t post them here - you can send them to my e-mail address: jsaks@vmware.com
Before collecting the logs, please enable debug logging for vmtools by adding the following to /etc/vmware-tools/tools.conf:
[logging] log = true
vmtoolsd.level = debug vmtoolsd.handler = file >vmtoolsd.data = /var/log/vmtoolsd.log
vmsvc.level = debug vmsvc.handler = file vmsvc.data = /var/log/vmsvc.log
Please send me both /var/log/vmtoolsd.log and /var/log/vmsvc.log along with the vmware.log file. Thanks.
I suspect the quiescing operation is taking longer than expected. Which is leading to slower collection of guestInfo because we don't subtract the quiescing time from the warning message trigger.
Could we confirm how much time guest is being quiesced for when we see this warning? We can find the quiescing time from the vmware.log file from the VM datastore. It is the time gap between following log messages:
TIMESTAMP1| vmx| I125: ToolsBackup: changing quiesce state: STARTED -> COMMITTED ... TIMESTAMP2| vcpu-5| I125: ToolsBackup: changing quiesce state: COMMITTED -> FINISHING
Quiescing time=TIMESTAMP2 - TIMESTAMP1
Hello,
Taking one server with low activity, infrasrv1-d:
[root@infrasrv1-d ~]# tail /var/log/vmware-vmsvc.log
[Oct 21 13:17:30.465] [ warning] [guestinfo] *** WARNING: GuestInfo collection interval longer than expected; actual=60 sec, expected=30 sec. ***
[Oct 21 14:17:30.461] [ warning] [guestinfo] *** WARNING: GuestInfo collection interval longer than expected; actual=60 sec, expected=30 sec. ***
[Oct 21 19:17:30.220] [ warning] [guestinfo] *** WARNING: GuestInfo collection interval longer than expected; actual=60 sec, expected=30 sec. ***
[Oct 21 20:17:30.304] [ warning] [guestinfo] *** WARNING: GuestInfo collection interval longer than expected; actual=60 sec, expected=30 sec. ***
[Oct 22 04:17:30.400] [ warning] [guestinfo] *** WARNING: GuestInfo collection interval longer than expected; actual=60 sec, expected=30 sec. ***
[Oct 22 05:17:30.356] [ warning] [guestinfo] *** WARNING: GuestInfo collection interval longer than expected; actual=60 sec, expected=30 sec. ***
[Oct 22 06:17:30.156] [ warning] [guestinfo] *** WARNING: GuestInfo collection interval longer than expected; actual=60 sec, expected=30 sec. ***
[Oct 22 07:17:30.429] [ warning] [guestinfo] *** WARNING: GuestInfo collection interval longer than expected; actual=60 sec, expected=30 sec. ***
[Oct 22 11:17:30.344] [ warning] [guestinfo] *** WARNING: GuestInfo collection interval longer than expected; actual=60 sec, expected=30 sec. ***
[Oct 22 14:17:30.498] [ warning] [guestinfo] *** WARNING: GuestInfo collection interval longer than expected; actual=60 sec, expected=30 sec. ***
[root@infrasrv1-d ~]#
According to the log, there was an issue at 11:17, but no issue at 10:17 or 12:17.
checking the vmware server related log (I copied and transferred it to infrasrv1-d to easily grep on it):
[root@infrasrv1-d tmp]# cat infrasrv1-d.log | grep ToolsBackup | sed 's/DONE/DONE\n-------------------------------------/g' | tail -n 19
-------------------------------------
2018-10-22T10:17:01.826Z| vcpu-0| I125: ToolsBackup: changing quiesce state: IDLE -> STARTED
2018-10-22T10:17:03.829Z| vmx| I125: ToolsBackup: changing quiesce state: STARTED -> COMMITTED
2018-10-22T10:17:04.365Z| vcpu-0| I125: ToolsBackup: changing quiesce state: COMMITTED -> FINISHING
2018-10-22T10:17:06.366Z| vmx| I125: ToolsBackup: changing quiesce state: FINISHING -> IDLE
2018-10-22T10:17:06.366Z| vmx| I125: ToolsBackup: changing quiesce state: IDLE -> DONE
-------------------------------------
2018-10-22T11:16:56.958Z| vcpu-0| I125: ToolsBackup: changing quiesce state: IDLE -> STARTED
2018-10-22T11:16:58.962Z| vmx| I125: ToolsBackup: changing quiesce state: STARTED -> COMMITTED
2018-10-22T11:16:59.603Z| vcpu-0| I125: ToolsBackup: changing quiesce state: COMMITTED -> FINISHING
2018-10-22T11:17:01.604Z| vmx| I125: ToolsBackup: changing quiesce state: FINISHING -> IDLE
2018-10-22T11:17:01.604Z| vmx| I125: ToolsBackup: changing quiesce state: IDLE -> DONE
-------------------------------------
2018-10-22T12:16:57.008Z| vcpu-0| I125: ToolsBackup: changing quiesce state: IDLE -> STARTED
2018-10-22T12:16:59.011Z| vmx| I125: ToolsBackup: changing quiesce state: STARTED -> COMMITTED
2018-10-22T12:16:59.514Z| vcpu-0| I125: ToolsBackup: changing quiesce state: COMMITTED -> FINISHING
2018-10-22T12:17:01.516Z| vmx| I125: ToolsBackup: changing quiesce state: FINISHING -> IDLE
2018-10-22T12:17:01.516Z| vmx| I125: ToolsBackup: changing quiesce state: IDLE -> DONE
-------------------------------------
[root@infrasrv1-d tmp]#
I don't see any quiescing duration issue there.
BR, Yannick
@Tontonitch Thanks for the logs. Yes, all operation you have capture at around 11th hour finished well within 5-6 seconds. There seems to be some other reason for the slowness. We need to see what other operation is being performed on the VM every hour. So, please follow up with the logs Jonathan asked for in his last update.
I'm also curious if you see this issue on any VMs that are running version 10.2.0 or later? I'm asking this because there was a fix made in 10.2.0 that's missing in 10.1.5. The fix mainly affects the memory snapshot, suspend/resume, and vMotion cases.
I'm also curious if you see this issue on any VMs that are running version 10.2.0 or later? I'm asking this because there was a fix made in 10.2.0 that's missing in 10.1.5. The fix mainly affects the memory snapshot, suspend/resume, and vMotion cases.
As the lastest version available on the RedHat repositories is 10.1.10, I don't have any server with a newer version.
So, please follow up with the logs Jonathan asked for in his last update.
I sent it to him. But the log sent was for a particular snpshoting issue. I would like to send you what corresponds to a low activity server (infrasrv1-d) where snapshots can be taken but where we face also that warning "WARNING: GuestInfo collection interval longer than expected". Can I send it to your email address?
@Tontonitch
Re:
So, please follow up with the logs Jonathan asked for in his last update.
I sent it to him. But the log sent was for a particular snpshoting issue.
My understanding is that you sent me the log for the other issue you filed here, #297. However, in my last post above I asked some questions related to this issue and also requested logs. I haven't seen a response to that.
If I misunderstood, please clarify. Thanks.
However, in my last post above I asked some questions related to this issue and also requested logs. I haven't seen a response to that.
Indeed, I missed the message. I will collect that and send you the result.
Hello,
Troubleshooting ponctual vmware snapshoting failures of our RedHat virtual servers, we noticed the following error messages on all our RHEL (7.5) servers:
The package is up-to-date (with what RedHat provides):
How can I solve that?
Best regards