virgo-agent-toolkit / rackspace-monitoring-agent

Rackspace Cloud Monitoring Agent
http://virgoagent.com/rackspace-monitoring-agent/
Apache License 2.0
118 stars 39 forks source link

Monitoring Agent Service constantly grows in Memory #984

Open domibay-hugo opened 4 years ago

domibay-hugo commented 4 years ago

When the Monitoring Agent just was started it starts off with 8.4 MB Consumption of Memory

# systemctl status rackspace-monitoring-agent -l
● rackspace-monitoring-agent.service
   Loaded: loaded (/etc/systemd/system/rackspace-monitoring-agent.service; enabled; vendor preset: disabled)
   Active: active (running) since vie 2020-09-04 14:15:37 WEST; 6s ago
 Main PID: 31732 (rackspace-monit)
   CGroup: /system.slice/rackspace-monitoring-agent.service
           └─31732 /usr/bin/rackspace-monitoring-agent -l /var/log/rackspace-monitoring-agent.log --production --exit-on-upgrade
# ps -o pid,rss,etime,start,cmd -p 31732|more
  PID   RSS     ELAPSED  STARTED CMD
31732  8668       00:28 14:15:37 /usr/bin/rackspace-monitoring-agent -l /var/log/rackspace-monitorin
g-agent.log --production --exit-on-upgrade
# echo "scale=3; 8668/1024"|bc -l
8.464

But as the time goes by it grows up to 10 times of its Size to 92 MB and more:

# systemctl status rackspace-monitoring-agent -l
● rackspace-monitoring-agent.service
   Loaded: loaded (/etc/systemd/system/rackspace-monitoring-agent.service; enabled; vendor preset: disabled)
   Active: active (running) since mié 2019-11-13 04:56:16 WET; 9 months 22 days ago
 Main PID: 5298 (rackspace-monit)
   CGroup: /system.slice/rackspace-monitoring-agent.service
           └─5298 /usr/bin/rackspace-monitoring-agent -l /var/log/rackspace-monitoring-agent.log --production --exit-on-upgrade
# ps -o pid,rss,etime,start,cmd -p 5298|more
  PID   RSS     ELAPSED  STARTED CMD
 5298 95216 296-08:18:13   Nov 13 /usr/bin/rackspace-monitoring-agent -l /var/log/rackspace-monitoring-agent.log --production --exit-on-upgrade
# echo "scale=3; 95216/1024"|bc -l
92.984

Obviously it is not freeing the used Memory correctly

itzg commented 4 years ago

Even 92 MB seems like a reasonable utilization. Have you observed the agent continuing to use an ever increasing amount of memory? How many checks are you running? What kinds of checks?

domibay-hugo commented 4 years ago

Yes, the Agent continues to grow. If it was just because of the Checks it would instantly grow to the necessary size and stay like this for its life time. But It continues growing and I'm forced to restart it to free the used Memory for the System it is running on: On this host it has already grown to more than 152 MB

# systemctl status rackspace-monitoring-agent -l
● rackspace-monitoring-agent.service
   Loaded: loaded (/etc/systemd/system/rackspace-monitoring-agent.service; enabled; vendor preset: disabled)
   Active: active (running) since mar 2019-11-05 20:31:33 WET; 9 months 29 days ago
 Main PID: 3875 (rackspace-monit)
   CGroup: /system.slice/rackspace-monitoring-agent.service
           └─3875 /usr/bin/rackspace-monitoring-agent -l /var/log/rackspace-monitoring-agent.log --production --exit-on-upgrade

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
# ps -o pid,rss,etime,start,cmd -p 3875|more
  PID   RSS     ELAPSED  STARTED CMD
 3875 156484 303-17:37:08   Nov 05 /usr/bin/rackspace-monitoring-agent -l /var/log/rackspace-monitoring-agent.log --production --exit-on-upgrade
# echo "scale=3; 156484/1024"|bc -l
152.816

We just have the 5 Basic Checks enabled:

# cat /var/log/rackspace-monitoring-agent.log|grep -ioE "Check \([^\)]+\)"|sort|uniq
Check (agent.filesystem)
Check (agent.filesystem_state)
Check (agent.load_average)
Check (agent.memory)
Check (agent.network)
itzg commented 4 years ago

What version of the agent are you running? Speaking of the logs, do you see any errors or unusual messages logged there?

Since we can't recreate this issue on our system, can you try disabling the agent.filesystem_state check? That one is relatively new and it would interesting to narrow down if a particular check is causing memory growth.

domibay-hugo commented 4 years ago

I'm observing this issue for some years now but it was never corrected so I feel that I should document it now:

# rpm -qi rackspace-monitoring-agent
Name        : rackspace-monitoring-agent
Version     : 2.6.19
Release     : 1
Architecture: x86_64
Install Date: mar 18 dic 2018 11:08:22 WET
Group       : unknown
Size        : 5020341
License     : unknown
Signature   : RSA/SHA256, mar 16 oct 2018 04:16:58 WEST, Key ID a086f077d05ab914
Source RPM  : rackspace-monitoring-agent-2.6.19-1.src.rpm
Build Date  : mar 16 oct 2018 04:16:57 WEST
Build Host  : centos-7-rebuild
Relocations : /usr 
Vendor      : Rackspace
Summary     : Rackspace Monitoring Agent
Description :
DESCRIPTION
===========

This is an installer created using CPack (http://www.cmake.org). No additional installation instructions provided.

I always trying to update the Agent in reasonable intervals but none of the updates has inproved the situation.

itzg commented 4 years ago

2.6.23 is available, but I wouldn't expect that to make much difference since there seems to be something unique to your environment.

https://stable.packages.cloudmonitoring.rackspace.com/centos-7-x86_64/rackspace-monitoring-agent-2.6.23-amd64.rpm

Any findings with disabling the agent.filesystem_state check? Or strange log messages?

domibay-hugo commented 4 years ago

In the Server Cloud Panel I see that it has now an connectivity issue:

SSH Keyname: None
Disk Configuration: Manual
Monitoring Agent: Not installed, host checks unavailableHow To Install Agent
Region: London (LON)
Created Date: Nov 16, 2017 - 12:07 PM GMT
Last Updated: Jan 15, 2020 - 11:18 PM GMT

The Logs show and a reconnecting cycle:

Thu Aug 13 05:22:36 PM 2020 INF: 2001:4800:7902:1:0:a:4323:4b:443 (hostname=agent-endpoint-dfw.monitoring.api.rackspacecloud.com connID=54) -> Connection has been authenticated to SRV:_monitoringagent._tcp.dfw1.prod.monitoring.api.rackspacecloud.com
Fri Aug 14 05:46:20 PM 2020 INF: SRV:_monitoringagent._tcp.ord1.prod.monitoring.api.rackspacecloud.com -> Retrying connection in 136108ms
Fri Aug 14 05:48:43 PM 2020 INF: SRV:_monitoringagent._tcp.ord1.prod.monitoring.api.rackspacecloud.com -> Retrying connection in 121687ms
Fri Aug 14 05:50:54 PM 2020 INF: SRV:_monitoringagent._tcp.ord1.prod.monitoring.api.rackspacecloud.com -> Retrying connection in 129288ms
Fri Aug 14 05:53:10 PM 2020 INF: SRV:_monitoringagent._tcp.ord1.prod.monitoring.api.rackspacecloud.com -> Retrying connection in 128735ms
Fri Aug 14 05:55:26 PM 2020 INF: 2001:4801:7902:1:0:a:4323:51:443 (hostname=agent-endpoint-ord.monitoring.api.rackspacecloud.com connID=61) -> Connected
Fri Aug 14 05:55:26 PM 2020 INF: 2001:4801:7902:1:0:a:4323:51:443 (hostname=agent-endpoint-ord.monitoring.api.rackspacecloud.com connID=61) -> Connection has been authenticated to SRV:_monitoringagent._tcp.ord1.prod.monitoring.api.rackspacecloud.com
Thu Aug 20 08:44:39 PM 2020 INF: SRV:_monitoringagent._tcp.lon3.prod.monitoring.api.rackspacecloud.com -> Retrying connection in 125448ms
Thu Aug 20 08:46:51 PM 2020 INF: SRV:_monitoringagent._tcp.lon3.prod.monitoring.api.rackspacecloud.com -> Retrying connection in 132585ms
Thu Aug 20 08:49:10 PM 2020 INF: SRV:_monitoringagent._tcp.lon3.prod.monitoring.api.rackspacecloud.com -> Retrying connection in 123468ms
Thu Aug 20 08:51:14 PM 2020 INF: SRV:_monitoringagent._tcp.lon3.prod.monitoring.api.rackspacecloud.com -> Retrying connection in 128680ms
Thu Aug 20 08:53:23 PM 2020 INF: SRV:_monitoringagent._tcp.lon3.prod.monitoring.api.rackspacecloud.com -> Retrying connection in 136312ms
Thu Aug 20 08:55:43 PM 2020 INF: SRV:_monitoringagent._tcp.lon3.prod.monitoring.api.rackspacecloud.com -> Retrying connection in 124148ms
domibay-hugo commented 4 years ago

It seems as if the Server has suffered a Crash on "Nov 05 08:29:05 PM 2019" and since that I was not able to reconnect to the central server:

Tue Nov 05 08:29:05 PM 2019 INF: Check (agent.filesystem) -> agent.filesystem (details=target="/",id="chzzIn1aur",period=60) scheduled for 60s
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Tue Nov 05 08:31:33 PM 2019 INF: Rackspace Monitoring Agent (2.6.19)
Tue Nov 05 08:31:33 PM 2019 INF:   virgo 2.1.10
Tue Nov 05 08:31:33 PM 2019 INF:   luvi v2.7.6-2-sigar-2-g97b1230
Tue Nov 05 08:31:33 PM 2019 INF:   libuv 1.9.1
Tue Nov 05 08:31:33 PM 2019 INF:   OpenSSL 1.0.2h  3 May 2016
Tue Nov 05 08:31:33 PM 2019 INF: Using locale: en_US.UTF-8
Tue Nov 05 08:31:33 PM 2019 INF: Using config file: /etc/rackspace-monitoring-agent.cfg
Tue Nov 05 08:31:33 PM 2019 INF: Machine ID unobtainable, Could not retrieve xenstore name, ret: 1, buffer:
Tue Nov 05 08:31:33 PM 2019 INF: Using hostname as agent ID (id=<server_name>)
Tue Nov 05 08:31:33 PM 2019 INF: Starting agent <server_name> (guid=<id>, version=2.1.10, bundle_version=2.6.19)
Tue Nov 05 08:31:33 PM 2019 INF: Confd -> reading files in /etc/rackspace-monitoring-agent.conf.d
Tue Nov 05 08:31:33 PM 2019 INF: Upgrades are enabled
Tue Nov 05 08:31:36 PM 2019 INF: 2a00:1a48:7902:1:0:a:432:390:443 (hostname=agent-endpoint-lon.monitoring.api.rackspacecloud.com connID=1) -> Connected
Tue Nov 05 08:31:36 PM 2019 INF: 2a00:1a48:7902:1:0:a:432:390:443 (hostname=agent-endpoint-lon.monitoring.api.rackspacecloud.com connID=1) -> Connection has been authenticated to SRV:_monitoringagent._tcp.lon3.prod.monitoring.api.rackspacecloud.com
Tue Nov 05 08:31:36 PM 2019 ERR: Confd -> Not sending config files because agent is not bound to an entity
Tue Nov 05 08:31:37 PM 2019 INF: 2001:4801:7902:1:0:a:4323:51:443 (hostname=agent-endpoint-ord.monitoring.api.rackspacecloud.com connID=1) -> Connected
Tue Nov 05 08:31:37 PM 2019 INF: 2001:4801:7902:1:0:a:4323:51:443 (hostname=agent-endpoint-ord.monitoring.api.rackspacecloud.com connID=1) -> Connection has been authenticated to SRV:_monitoringagent._tcp.ord1.prod.monitoring.api.rackspacecloud.com
Tue Nov 05 08:31:37 PM 2019 INF: 2001:4800:7902:1:0:a:4323:4c:443 (hostname=agent-endpoint-dfw.monitoring.api.rackspacecloud.com connID=1) -> Connected
Tue Nov 05 08:31:37 PM 2019 INF: 2001:4800:7902:1:0:a:4323:4c:443 (hostname=agent-endpoint-dfw.monitoring.api.rackspacecloud.com connID=1) -> Connection has been authenticated to SRV:_monitoringagent._tcp.dfw1.prod.monitoring.api.rackspacecloud.com
Tue Nov 12 07:53:35 PM 2019 INF: SRV:_monitoringagent._tcp.lon3.prod.monitoring.api.rackspacecloud.com -> Retrying connection in 77264ms
Tue Nov 12 07:54:58 PM 2019 INF: SRV:_monitoringagent._tcp.lon3.prod.monitoring.api.rackspacecloud.com -> Retrying connection in 77799ms
Tue Nov 12 07:56:22 PM 2019 INF: SRV:_monitoringagent._tcp.lon3.prod.monitoring.api.rackspacecloud.com -> Retrying connection in 82300ms
Tue Nov 12 07:57:45 PM 2019 INF: 2a00:1a48:7902:1:0:a:603:611:443 (hostname=agent-endpoint-lon.monitoring.api.rackspacecloud.com connID=4) -> Connected
Tue Nov 12 07:57:45 PM 2019 INF: 2a00:1a48:7902:1:0:a:603:611:443 (hostname=agent-endpoint-lon.monitoring.api.rackspacecloud.com connID=4) -> Connection has been authenticated to SRV:_monitoringagent._tcp.lon3.prod.monitoring.api.rackspacecloud.com
Wed Nov 13 09:19:51 PM 2019 INF: SRV:_monitoringagent._tcp.lon3.prod.monitoring.api.rackspacecloud.com -> Retrying connection in 97969ms
oring.api.rackspacecloud.com connID=5) -> Connected
Wed Nov 13 09:21:30 PM 2019 INF: 2a00:1a48:7902:1:0:a:831:205:443 (hostname=agent-endpoint-lon.monitoring.api.rackspacecloud.com connID=5) -> Connection has been authenticated to SRV:_monitoringagent._tcp.lon3.prod.monitoring.api.rackspacecloud.com
Wed Dec 25 11:21:39 AM 2019 INF: SRV:_monitoringagent._tcp.lon3.prod.monitoring.api.rackspacecloud.com -> Retrying connection in 106422ms
Wed Dec 25 11:23:37 AM 2019 INF: SRV:_monitoringagent._tcp.lon3.prod.monitoring.api.rackspacecloud.com -> Retrying connection in 113595ms
Wed Dec 25 11:25:38 AM 2019 INF: 2a00:1a48:7902:1:0:a:432:387:443 (hostname=agent-endpoint-lon.monitoring.api.rackspacecloud.com connID=7) -> Connected
Wed Dec 25 11:25:38 AM 2019 INF: 2a00:1a48:7902:1:0:a:432:387:443 (hostname=agent-endpoint-lon.monitoring.api.rackspacecloud.com connID=7) -> Connection has been authenticated to SRV:_monitoringagent._tcp.lon3.prod.monitoring.api.rackspacecloud.com
Wed Dec 25 11:40:16 AM 2019 INF: SRV:_monitoringagent._tcp.lon3.prod.monitoring.api.rackspacecloud.com -> Retrying connection in 118620ms
Wed Dec 25 11:42:16 AM 2019 INF: 2a00:1a48:7902:1:0:a:432:388:443 (hostname=agent-endpoint-lon.monitoring.api.rackspacecloud.com connID=8) -> Connected
Wed Dec 25 11:42:16 AM 2019 INF: 2a00:1a48:7902:1:0:a:432:388:443 (hostname=agent-endpoint-lon.monitoring.api.rackspacecloud.com connID=8) -> Connection has been authenticated to SRV:_monitoringagent._tcp.lon3.prod.monitoring.api.rackspacecloud.com
Mon Mar 02 08:48:09 PM 2020 INF: SRV:_monitoringagent._tcp.lon3.prod.monitoring.api.rackspacecloud.com -> Retrying connection in 125352ms
Mon Mar 02 08:50:22 PM 2020 INF: SRV:_monitoringagent._tcp.lon3.prod.monitoring.api.rackspacecloud.com -> Retrying connection in 120141ms
Mon Mar 02 08:52:29 PM 2020 INF: SRV:_monitoringagent._tcp.lon3.prod.monitoring.api.rackspacecloud.com -> Retrying connection in 121631ms
Mon Mar 02 08:54:38 PM 2020 INF: SRV:_monitoringagent._tcp.lon3.prod.monitoring.api.rackspacecloud.com -> Retrying connection in 124482ms

It seems as it did not register any activity between Dec 2019 and Mar 2020 even though the Panel states that it informed until Jan 2020

itzg commented 4 years ago

The agent connects to three datacenters in order to remain highly available. Only the LON connection seems to be having issues, the DFW and ORD connections appear to be established at:

Tue Nov 05 08:31:37 PM 2019 INF: 2001:4801:7902:1:0:a:4323:51:443 (hostname=agent-endpoint-ord.monitoring.api.rackspacecloud.com connID=1) -> Connected
Tue Nov 05 08:31:37 PM 2019 INF: 2001:4801:7902:1:0:a:4323:51:443 (hostname=agent-endpoint-ord.monitoring.api.rackspacecloud.com connID=1) -> Connection has been authenticated to SRV:_monitoringagent._tcp.ord1.prod.monitoring.api.rackspacecloud.com
Tue Nov 05 08:31:37 PM 2019 INF: 2001:4800:7902:1:0:a:4323:4c:443 (hostname=agent-endpoint-dfw.monitoring.api.rackspacecloud.com connID=1) -> Connected
Tue Nov 05 08:31:37 PM 2019 INF: 2001:4800:7902:1:0:a:4323:4c:443 (hostname=agent-endpoint-dfw.monitoring.api.rackspacecloud.com connID=1) -> Connection has been authenticated to SRV:_monitoringagent._tcp.dfw1.prod.monitoring.api.rackspacecloud.com
domibay-hugo commented 4 years ago

But the other host I commented first was just running fine all the time and also did grow in memory. The oldest report conserved states from 2020-04-06:

Mon Apr 06 12:18:17 AM 2020 INF: Check (agent.filesystem_state) -> agent.filesystem_state (details=,id="chBE9OtoFG",period=60) scheduled for 60s
Mon Apr 06 12:18:22 AM 2020 INF: Check (agent.filesystem) -> agent.filesystem (details=target="/",id="chTE4iaBEo",period=60) scheduled for 60s
Mon Apr 06 12:18:25 AM 2020 INF: Check (agent.load_average) -> agent.load_average (details=,id="cheLlmcgLI",period=60) scheduled for 60s
Mon Apr 06 12:18:35 AM 2020 INF: Check (agent.memory) -> agent.memory (details=,id="chJpBmgq4y",period=60) scheduled for 60s
Mon Apr 06 12:18:37 AM 2020 INF: Check (agent.network) -> agent.network (details=target="eth0",id="chBD5dQSVp",period=60) scheduled for 60s

I took these 2 measurements from 2 different Servers. Both have grown in Memory. 1 was running fine and only the other had and connectivity issue.

domibay-hugo commented 4 years ago

I disabled the agent.filesystem_state Check in the Administration Panel for the Host that was informing correctly since 2020-04-06. I also updated the Agent on the 2nd host that had the crash and could not connect since then.

As an remark: the unconnectable agent was able to connect correctly after Service Restart

domibay-hugo commented 4 years ago

The agent connects to three datacenters in order to remain highly available.

If it was for high availability why isn't there any data in the Intelligence Panel if only 1 connection is broken?

obviously any broken connection is fatal to the agent.

itzg commented 4 years ago

Looking back at your logs, it appears the VM server identity couldn't get discovered like normal:

Tue Nov 05 08:31:33 PM 2019 INF: Machine ID unobtainable, Could not retrieve xenstore name, ret: 1, buffer:
Tue Nov 05 08:31:33 PM 2019 INF: Using hostname as agent ID (id=<server_name>)

As a workaround you can set monitoring_id (more info here) to the server ID, which will be a UUID value from the server panel.

However, it sounds like restarting the service may have resolved that issue also.