vmware / open-vm-tools

Official repository of VMware open-vm-tools project
http://sourceforge.net/projects/open-vm-tools/
2.28k stars 428 forks source link

Reccurring error message with open-vm-tools : WARNING: GuestInfo collection interval longer than expected #296

Open Tontonitch opened 6 years ago

Tontonitch commented 6 years ago

Hello,

Troubleshooting ponctual vmware snapshoting failures of our RedHat virtual servers, we noticed the following error messages on all our RHEL (7.5) servers:

[root@rhsat6srv1 ~]# tail /var/log/vmware-vmsvc.log
[Oct 15 19:01:53.760] [ warning] [guestinfo] *** WARNING: GuestInfo collection interval longer than expected; actual=69 sec, expected=30 sec. ***
[Oct 15 21:01:54.213] [ warning] [guestinfo] *** WARNING: GuestInfo collection interval longer than expected; actual=64 sec, expected=30 sec. ***
[Oct 15 21:50:04.229] [ warning] [guestinfo] *** WARNING: GuestInfo collection interval longer than expected; actual=70 sec, expected=30 sec. ***
[Oct 15 22:01:51.225] [ warning] [guestinfo] *** WARNING: GuestInfo collection interval longer than expected; actual=72 sec, expected=30 sec. ***
[Oct 15 23:31:48.972] [ warning] [guestinfo] *** WARNING: GuestInfo collection interval longer than expected; actual=64 sec, expected=30 sec. ***
[Oct 16 02:02:01.945] [ warning] [guestinfo] *** WARNING: GuestInfo collection interval longer than expected; actual=72 sec, expected=30 sec. ***
[Oct 16 06:01:59.500] [ warning] [guestinfo] *** WARNING: GuestInfo collection interval longer than expected; actual=71 sec, expected=30 sec. ***
[Oct 16 09:01:54.809] [ warning] [guestinfo] *** WARNING: GuestInfo collection interval longer than expected; actual=63 sec, expected=30 sec. ***
[Oct 16 10:01:44.569] [ warning] [guestinfo] *** WARNING: GuestInfo collection interval longer than expected; actual=72 sec, expected=30 sec. ***
[Oct 16 11:01:48.010] [ warning] [guestinfo] *** WARNING: GuestInfo collection interval longer than expected; actual=64 sec, expected=30 sec. ***
[root@rhsat6srv1 ~]#

The package is up-to-date (with what RedHat provides):

[root@rhsat6srv1 ~]# yum info open-vm-tools.x86_64
Loaded plugins: priorities, product-id, search-disabled-repos, subscription-manager
Installed Packages
Name        : open-vm-tools
Arch        : x86_64
Version     : 10.1.10
Release     : 3.el7_5.1
Size        : 2.1 M
Repo        : installed
From repo   : rhel-7-server-rpms
Summary     : Open Virtual Machine Tools for virtual machines hosted on VMware
URL         : https://github.com/vmware/open-vm-tools
License     : GPLv2
Description : The open-vm-tools project is an open source implementation of VMware Tools. It
            : is a suite of open source virtualization utilities and drivers to improve the
            : functionality, user experience and administration of VMware virtual machines.
            : This package contains only the core user-space programs and libraries of
            : open-vm-tools.

[root@rhsat6srv1 ~]#
[root@rhsat6srv1 ~]# yum update open-vm-tools.x86_64
Loaded plugins: priorities, product-id, search-disabled-repos, subscription-manager
No packages marked for update
[root@rhsat6srv1 ~]#

How can I solve that?

Best regards

jonathanvmw commented 6 years ago

This means that vmtoolsd is not being run as frequently as it expects. Usually this means your system is under-resourced for the load it's running. This could be either a guest level issue or a host level issue (or sometimes both). It sounds like you’re seeing it on multiple guests. Are all these guests on the same host?

Generally, you’ll want to check the load on the guest(s) and/or the host(s) to see whether they are CPU bound or i/o bound, and adjust resources accordingly (or adjust the workload if that’s an option).

johnwvmw commented 6 years ago

Looking at the time stamps on the log messages, it seems like this is generally occurring on the hour. By any chance does this correspond to cron jobs being initiated on each system?

If so and if multiple jobs are being started at the same time, you may want to explore a staggered start time.

Tontonitch commented 6 years ago

Looking at the time stamps on the log messages, it seems like this is generally occurring on the hour.

We have hourly snapshots taken for all our vm, from vcenter. So sometime during that operation it generates that warning.

Usually this means your system is under-resourced for the load it's running.

Not the vm guests for sure, as it happen also on servers with low load. Not on ESX side, as reported by our VMware management team.

I did some investigations and it seems that this warning started to be reported after the update of the open-vm-tools from 10.0.5.52125 (build-3227872) to 10.1.5.59732 (build-5055683).

See the filtered vmware-vmsvc.log files attached, taken from various servers.

Tontonitch commented 6 years ago

server1_vmware-vmsvc_filtered.log server4_vmware-vmsvc_filtered.log server3_vmware-vmsvc_filtered.log server2_vmware-vmsvc_filtered.log

jonathanvmw commented 6 years ago

@Tontonitch

Re:

I did some investigations and it seems that this warning started to be reported after the update of the open-vm-tools from 10.0.5.52125 (build-3227872) to 10.1.5.59732 (build-5055683).

This warning message was added between those two releases, so most likely the condition that is being warned of predated your upgrade to 10.1.5.

How many VMs do you see the problem on? Are they all RHEL 7.5? Are they all running on the same ESX server, or are there multiple servers involved? What version(s) of ESX are you running?

Are multiple VMs running snapshots at the same time, or are you staggering them through the course of an hour in order to minimize any overlap in time among the snapshot operations on the VMs within a given server?

Can you send me a complete vmtools log and vmware.log for one VM where you see the problem? Since they may contain sensitive informatoin, don’t post them here - you can send them to my e-mail address: jsaks@vmware.com

Before collecting the logs, please enable debug logging for vmtools by adding the following to /etc/vmware-tools/tools.conf:

[logging] log = true

vmtoolsd.level = debug
 vmtoolsd.handler = file 
>vmtoolsd.data = /var/log/vmtoolsd.log

vmsvc.level = debug
 vmsvc.handler = file
 vmsvc.data = /var/log/vmsvc.log

Please send me both /var/log/vmtoolsd.log and /var/log/vmsvc.log along with the vmware.log file. Thanks.

ravindravmw commented 6 years ago

I suspect the quiescing operation is taking longer than expected. Which is leading to slower collection of guestInfo because we don't subtract the quiescing time from the warning message trigger.

Could we confirm how much time guest is being quiesced for when we see this warning? We can find the quiescing time from the vmware.log file from the VM datastore. It is the time gap between following log messages:

TIMESTAMP1| vmx| I125: ToolsBackup: changing quiesce state: STARTED -> COMMITTED ... TIMESTAMP2| vcpu-5| I125: ToolsBackup: changing quiesce state: COMMITTED -> FINISHING

Quiescing time=TIMESTAMP2 - TIMESTAMP1

Tontonitch commented 6 years ago

Hello,

Taking one server with low activity, infrasrv1-d:

[root@infrasrv1-d ~]# tail /var/log/vmware-vmsvc.log
[Oct 21 13:17:30.465] [ warning] [guestinfo] *** WARNING: GuestInfo collection interval longer than expected; actual=60 sec, expected=30 sec. ***
[Oct 21 14:17:30.461] [ warning] [guestinfo] *** WARNING: GuestInfo collection interval longer than expected; actual=60 sec, expected=30 sec. ***
[Oct 21 19:17:30.220] [ warning] [guestinfo] *** WARNING: GuestInfo collection interval longer than expected; actual=60 sec, expected=30 sec. ***
[Oct 21 20:17:30.304] [ warning] [guestinfo] *** WARNING: GuestInfo collection interval longer than expected; actual=60 sec, expected=30 sec. ***
[Oct 22 04:17:30.400] [ warning] [guestinfo] *** WARNING: GuestInfo collection interval longer than expected; actual=60 sec, expected=30 sec. ***
[Oct 22 05:17:30.356] [ warning] [guestinfo] *** WARNING: GuestInfo collection interval longer than expected; actual=60 sec, expected=30 sec. ***
[Oct 22 06:17:30.156] [ warning] [guestinfo] *** WARNING: GuestInfo collection interval longer than expected; actual=60 sec, expected=30 sec. ***
[Oct 22 07:17:30.429] [ warning] [guestinfo] *** WARNING: GuestInfo collection interval longer than expected; actual=60 sec, expected=30 sec. ***
[Oct 22 11:17:30.344] [ warning] [guestinfo] *** WARNING: GuestInfo collection interval longer than expected; actual=60 sec, expected=30 sec. ***
[Oct 22 14:17:30.498] [ warning] [guestinfo] *** WARNING: GuestInfo collection interval longer than expected; actual=60 sec, expected=30 sec. ***
[root@infrasrv1-d ~]#

According to the log, there was an issue at 11:17, but no issue at 10:17 or 12:17.

checking the vmware server related log (I copied and transferred it to infrasrv1-d to easily grep on it):

[root@infrasrv1-d tmp]# cat infrasrv1-d.log | grep ToolsBackup | sed 's/DONE/DONE\n-------------------------------------/g' | tail -n 19
-------------------------------------
2018-10-22T10:17:01.826Z| vcpu-0| I125: ToolsBackup: changing quiesce state: IDLE -> STARTED
2018-10-22T10:17:03.829Z| vmx| I125: ToolsBackup: changing quiesce state: STARTED -> COMMITTED
2018-10-22T10:17:04.365Z| vcpu-0| I125: ToolsBackup: changing quiesce state: COMMITTED -> FINISHING
2018-10-22T10:17:06.366Z| vmx| I125: ToolsBackup: changing quiesce state: FINISHING -> IDLE
2018-10-22T10:17:06.366Z| vmx| I125: ToolsBackup: changing quiesce state: IDLE -> DONE
-------------------------------------
2018-10-22T11:16:56.958Z| vcpu-0| I125: ToolsBackup: changing quiesce state: IDLE -> STARTED
2018-10-22T11:16:58.962Z| vmx| I125: ToolsBackup: changing quiesce state: STARTED -> COMMITTED
2018-10-22T11:16:59.603Z| vcpu-0| I125: ToolsBackup: changing quiesce state: COMMITTED -> FINISHING
2018-10-22T11:17:01.604Z| vmx| I125: ToolsBackup: changing quiesce state: FINISHING -> IDLE
2018-10-22T11:17:01.604Z| vmx| I125: ToolsBackup: changing quiesce state: IDLE -> DONE
-------------------------------------
2018-10-22T12:16:57.008Z| vcpu-0| I125: ToolsBackup: changing quiesce state: IDLE -> STARTED
2018-10-22T12:16:59.011Z| vmx| I125: ToolsBackup: changing quiesce state: STARTED -> COMMITTED
2018-10-22T12:16:59.514Z| vcpu-0| I125: ToolsBackup: changing quiesce state: COMMITTED -> FINISHING
2018-10-22T12:17:01.516Z| vmx| I125: ToolsBackup: changing quiesce state: FINISHING -> IDLE
2018-10-22T12:17:01.516Z| vmx| I125: ToolsBackup: changing quiesce state: IDLE -> DONE
-------------------------------------
[root@infrasrv1-d tmp]#

I don't see any quiescing duration issue there.

BR, Yannick

ravindravmw commented 6 years ago

@Tontonitch Thanks for the logs. Yes, all operation you have capture at around 11th hour finished well within 5-6 seconds. There seems to be some other reason for the slowness. We need to see what other operation is being performed on the VM every hour. So, please follow up with the logs Jonathan asked for in his last update.

I'm also curious if you see this issue on any VMs that are running version 10.2.0 or later? I'm asking this because there was a fix made in 10.2.0 that's missing in 10.1.5. The fix mainly affects the memory snapshot, suspend/resume, and vMotion cases.

Tontonitch commented 6 years ago

I'm also curious if you see this issue on any VMs that are running version 10.2.0 or later? I'm asking this because there was a fix made in 10.2.0 that's missing in 10.1.5. The fix mainly affects the memory snapshot, suspend/resume, and vMotion cases.

As the lastest version available on the RedHat repositories is 10.1.10, I don't have any server with a newer version.

So, please follow up with the logs Jonathan asked for in his last update.

I sent it to him. But the log sent was for a particular snpshoting issue. I would like to send you what corresponds to a low activity server (infrasrv1-d) where snapshots can be taken but where we face also that warning "WARNING: GuestInfo collection interval longer than expected". Can I send it to your email address?

jonathanvmw commented 6 years ago

@Tontonitch

Re:

So, please follow up with the logs Jonathan asked for in his last update.

I sent it to him. But the log sent was for a particular snpshoting issue.

My understanding is that you sent me the log for the other issue you filed here, #297. However, in my last post above I asked some questions related to this issue and also requested logs. I haven't seen a response to that.

If I misunderstood, please clarify. Thanks.

Tontonitch commented 6 years ago

However, in my last post above I asked some questions related to this issue and also requested logs. I haven't seen a response to that.

Indeed, I missed the message. I will collect that and send you the result.