Closed mac133k closed 1 month ago
hi @mac133k - thanks for reporting this! It's a new one to us ...
pmcd.log filling up with messages like this one:
Could you capture contents of "sudo ls -l /proc/[pmcd PID]/fd" under these conditions and share? It'll be alot of output I imagine, so maybe a summary of what the file descriptors look like, or copy the whole lot to a github gist perhaps?
It takes about 2 days for pmcd.log to grow from 0 to 30GB.
Wow, OK. That log message is ~120bytes. If we work off 2 days, that's 172800 seconds - at 30GB, if my back-of-the-napkin math is correct you're getting this message logged ~1500 times per second ... ? I can't think of anything in PCP itself that might do that, so really interested to see that fd listing - it feels like it must be a rogue pmcd client rapidly attempting new connections.
| Feb 18 23:47:21 node16 kernel: pmproxy[74409]: segfault at 0 ip 00007f84bbb04eb0 sp 00007f84b5e92b00 error 4 in libuv.so.1.0.0[7f84bbafa000+2c000]
If you have a core file somewhere (abrt?) it'd be really handy to get a symbolic stack trace from this.
I had another another occurrence of this problem. This time there was no segfault, so that must have been unrelated, unfortunately coredump was not available to look deeper into it. I did manage to capture more debug data for the main issue:
pcp 24331 1 80 Feb20 ? 1-23:05:32 /usr/libexec/pcp/bin/pmcd -A
root 24335 24331 5 Feb20 ? 03:22:57 /var/lib/pcp/pmdas/root/pmdaroot
root 24336 24335 5 Feb20 ? 03:02:23 /var/lib/pcp/pmdas/proc/pmdaproc -d 3
root 24338 24335 4 Feb20 ? 02:51:53 /var/lib/pcp/pmdas/xfs/pmdaxfs -d 11
root 24339 24335 4 Feb20 ? 02:37:18 /var/lib/pcp/pmdas/linux/pmdalinux
root 24340 24335 3 Feb20 ? 02:20:07 /var/lib/pcp/pmdas/kvm/pmdakvm -d 95
pcp 24341 1 0 Feb20 ? 00:12:28 /usr/libexec/pcp/bin/pmproxy -F -A
on_write_callback: ERROR uv_write failed
on_write_callback: ERROR uv_write failed
A few more details that I need to add:
so the URL is ```http://< hostname >:44323/metrics?target=kernel%2Cmem%2Cswap%2Cnetwork%2Cfilesys%2Cdisk%2Chinv``` and the number of metric instances is about 1.5k which takes between 0.4 and 0.7 seconds to scrape.
PCP processes do not seem to be overwhelmed with scrape requests judging by the CPU and memory utilisation per process. I suspect using the target param may be somehow causing the issue, because I remember having a problem some time ago with PCP v5.0 or 5.1 where scrapes were incomplete or timing out until I changed Prometheus config and split the original job into multiple jobs with only one metric group as a target each. I'm going to kill the Prometheus instance that has targets in the config and see if it makes any difference here.
| there were 1024 FDs held by pmcd:
Hmm, suspicious - seems like that maxfd setting of 1048576 hasn't taken effect? Might be worth checking it while pmcd is running via "cat /proc/[pmcd PID]/limits".
Indeed:
# cat /proc/83150/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 31128 31128 processes
Max open files 1024 4096 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 31128 31128 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
However 1024 is way more than a healthy PMCD process needs, so I would rather dig deeper to find the root cause than lift this limit wherever it is set.
Agreed, 1024 should be plenty. I wonder if we might be seeing a failure to cleanup timed-out connections between pmproxy and pmcd. You could experiment with larger values (well beyond your Prometheus scrape interval) of the polltimeout setting - in /etc/pcp/pmproxy/pmproxy.conf that would be:
[pmwebapi]
timeout = 120
(this is a value in seconds)
Querying metric endpoints without the target params helped to significantly reduce the occurrence of this issue, but did not eliminate it completely.
Just thought I'd chime in here- I'm having the same problem. Using prometheus, RedHat 8.5, and PCP 5.3.5. pmcd runs fine for weeks, but at some point something makes it go nuts and it rapidly consumes file descriptors until it hits the limit, and then the log file fills up my disk partition very rapidly.
Kevin
-- Kevin Hildebrand University of Maryland
@ke3vin @mac133k could you do some analysis of the contents of the /proc/[pmcd PID]/fds directory when the issue occurs? I think that's got to be the first bit of information we need to diagnose further.
If someone can produce an exact recipe to induce the failure (ideally as fast as possible), please paste it here too. Thanks!
I haven't figured out how to duplicate it yet, but as I have it on about two dozen hosts, at least one of them seems to get in the situation almost daily. I've got one that way now. Lots of open sockets from localhost->localhost:
From lsof: (just a small chunk of the thousand+ open sockets) pmcd 2236 pcp 1009u IPv6 5670302 0t0 TCP localhost:pmcd->localhost:37096 (ESTABLISHED) pmcd 2236 pcp 1010u IPv6 5670335 0t0 TCP localhost:pmcd->localhost:37150 (ESTABLISHED) pmcd 2236 pcp 1011u IPv6 5670343 0t0 TCP localhost:pmcd->localhost:37204 (ESTABLISHED) pmcd 2236 pcp 1012u IPv6 5671407 0t0 TCP localhost:pmcd->localhost:37258 (ESTABLISHED) pmcd 2236 pcp 1013u IPv6 5670348 0t0 TCP localhost:pmcd->localhost:37312 (ESTABLISHED) pmcd 2236 pcp 1014u IPv6 5670397 0t0 TCP localhost:pmcd->localhost:37366 (ESTABLISHED) pmcd 2236 pcp 1015u IPv6 5673037 0t0 TCP localhost:pmcd->localhost:37420 (ESTABLISHED) pmcd 2236 pcp 1016u IPv6 5669608 0t0 TCP localhost:pmcd->localhost:37474 (ESTABLISHED) pmcd 2236 pcp 1017u IPv6 5675230 0t0 TCP localhost:pmcd->localhost:37528 (ESTABLISHED) pmcd 2236 pcp 1018u IPv6 5675262 0t0 TCP localhost:pmcd->localhost:37582 (ESTABLISHED) pmcd 2236 pcp 1019u IPv6 5669637 0t0 TCP localhost:pmcd->localhost:37636 (ESTABLISHED) pmcd 2236 pcp 1020u IPv6 5676375 0t0 TCP localhost:pmcd->localhost:37690 (ESTABLISHED)
A similar snippet from /proc/2236/fd lrwx------. 1 root root 64 May 9 08:48 979 -> 'socket:[5652133]' lrwx------. 1 root root 64 May 9 08:48 98 -> 'socket:[5245025]' lrwx------. 1 root root 64 May 9 08:48 980 -> 'socket:[5645176]' lrwx------. 1 root root 64 May 9 08:48 981 -> 'socket:[5655122]' lrwx------. 1 root root 64 May 9 08:48 982 -> 'socket:[5652150]' lrwx------. 1 root root 64 May 9 08:48 983 -> 'socket:[5645233]' lrwx------. 1 root root 64 May 9 08:48 984 -> 'socket:[5655916]' lrwx------. 1 root root 64 May 9 08:48 985 -> 'socket:[5655964]' lrwx------. 1 root root 64 May 9 08:48 986 -> 'socket:[5645247]' lrwx------. 1 root root 64 May 9 08:48 987 -> 'socket:[5661697]' lrwx------. 1 root root 64 May 9 08:48 988 -> 'socket:[5656244]' lrwx------. 1 root root 64 May 9 08:48 989 -> 'socket:[5661430]' lrwx------. 1 root root 64 May 9 08:48 99 -> 'socket:[5243452]' lrwx------. 1 root root 64 May 9 08:48 990 -> 'socket:[5661491]'
I'll leave the broken process running for a bit, if you've got specific stuff you want me to look for, let me know.
Thanks, Kevin
On Sun, May 8, 2022 at 8:11 PM Nathan Scott @.***> wrote:
@ke3vin https://github.com/ke3vin @mac133k https://github.com/mac133k could you do some analysis of the contents of the /proc/[pmcd PID]/fds directory when the issue occurs? I think that's got to be the first bit of information we need to diagnose further.
If someone can produce an exact recipe to induce the failure (ideally as fast as possible), please paste it here too. Thanks!
— Reply to this email directly, view it on GitHub https://github.com/performancecopilot/pcp/issues/1533#issuecomment-1120514995, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATLPEMH5L5UF2CUCJPBARDVJBJ2XANCNFSM5O7JC6TQ . You are receiving this because you were mentioned.Message ID: @.***>
Ah, at least on my host, it's 'pmproxy' that has the other end of those connections...
pmproxy 2312 pcp 1000u IPv6 5656573 0t0 TCP localhost:37042->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1001u IPv6 5669522 0t0 TCP localhost:37096->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1002u IPv6 5667108 0t0 TCP localhost:37150->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1003u IPv6 5671371 0t0 TCP localhost:37204->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1004u IPv6 5669581 0t0 TCP localhost:37258->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1005u IPv6 5669590 0t0 TCP localhost:37312->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1006u IPv6 5672226 0t0 TCP localhost:37366->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1007u IPv6 5670403 0t0 TCP localhost:37420->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1008u IPv6 5673068 0t0 TCP localhost:37474->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1009u IPv6 5669630 0t0 TCP localhost:37528->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1010u IPv6 5670467 0t0 TCP localhost:37582->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1011u IPv6 5670487 0t0 TCP localhost:37636->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1012u IPv6 5670512 0t0 TCP localhost:37690->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1013u IPv6 5669668 0t0 TCP localhost:37744->localhost:pmcd (ESTABLISHED)
On Mon, May 9, 2022 at 8:49 AM Kevin M. Hildebrand @.***> wrote:
I haven't figured out how to duplicate it yet, but as I have it on about two dozen hosts, at least one of them seems to get in the situation almost daily. I've got one that way now. Lots of open sockets from localhost->localhost:
From lsof: (just a small chunk of the thousand+ open sockets) pmcd 2236 pcp 1009u IPv6 5670302 0t0 TCP localhost:pmcd->localhost:37096 (ESTABLISHED) pmcd 2236 pcp 1010u IPv6 5670335 0t0 TCP localhost:pmcd->localhost:37150 (ESTABLISHED) pmcd 2236 pcp 1011u IPv6 5670343 0t0 TCP localhost:pmcd->localhost:37204 (ESTABLISHED) pmcd 2236 pcp 1012u IPv6 5671407 0t0 TCP localhost:pmcd->localhost:37258 (ESTABLISHED) pmcd 2236 pcp 1013u IPv6 5670348 0t0 TCP localhost:pmcd->localhost:37312 (ESTABLISHED) pmcd 2236 pcp 1014u IPv6 5670397 0t0 TCP localhost:pmcd->localhost:37366 (ESTABLISHED) pmcd 2236 pcp 1015u IPv6 5673037 0t0 TCP localhost:pmcd->localhost:37420 (ESTABLISHED) pmcd 2236 pcp 1016u IPv6 5669608 0t0 TCP localhost:pmcd->localhost:37474 (ESTABLISHED) pmcd 2236 pcp 1017u IPv6 5675230 0t0 TCP localhost:pmcd->localhost:37528 (ESTABLISHED) pmcd 2236 pcp 1018u IPv6 5675262 0t0 TCP localhost:pmcd->localhost:37582 (ESTABLISHED) pmcd 2236 pcp 1019u IPv6 5669637 0t0 TCP localhost:pmcd->localhost:37636 (ESTABLISHED) pmcd 2236 pcp 1020u IPv6 5676375 0t0 TCP localhost:pmcd->localhost:37690 (ESTABLISHED)
A similar snippet from /proc/2236/fd lrwx------. 1 root root 64 May 9 08:48 979 -> 'socket:[5652133]' lrwx------. 1 root root 64 May 9 08:48 98 -> 'socket:[5245025]' lrwx------. 1 root root 64 May 9 08:48 980 -> 'socket:[5645176]' lrwx------. 1 root root 64 May 9 08:48 981 -> 'socket:[5655122]' lrwx------. 1 root root 64 May 9 08:48 982 -> 'socket:[5652150]' lrwx------. 1 root root 64 May 9 08:48 983 -> 'socket:[5645233]' lrwx------. 1 root root 64 May 9 08:48 984 -> 'socket:[5655916]' lrwx------. 1 root root 64 May 9 08:48 985 -> 'socket:[5655964]' lrwx------. 1 root root 64 May 9 08:48 986 -> 'socket:[5645247]' lrwx------. 1 root root 64 May 9 08:48 987 -> 'socket:[5661697]' lrwx------. 1 root root 64 May 9 08:48 988 -> 'socket:[5656244]' lrwx------. 1 root root 64 May 9 08:48 989 -> 'socket:[5661430]' lrwx------. 1 root root 64 May 9 08:48 99 -> 'socket:[5243452]' lrwx------. 1 root root 64 May 9 08:48 990 -> 'socket:[5661491]'
I'll leave the broken process running for a bit, if you've got specific stuff you want me to look for, let me know.
Thanks, Kevin
On Sun, May 8, 2022 at 8:11 PM Nathan Scott @.***> wrote:
@ke3vin https://github.com/ke3vin @mac133k https://github.com/mac133k could you do some analysis of the contents of the /proc/[pmcd PID]/fds directory when the issue occurs? I think that's got to be the first bit of information we need to diagnose further.
If someone can produce an exact recipe to induce the failure (ideally as fast as possible), please paste it here too. Thanks!
— Reply to this email directly, view it on GitHub https://github.com/performancecopilot/pcp/issues/1533#issuecomment-1120514995, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATLPEMH5L5UF2CUCJPBARDVJBJ2XANCNFSM5O7JC6TQ . You are receiving this because you were mentioned.Message ID: @.***>
Any additional thoughts on this? I'm having this issue almost daily when looking at ~20 machines. I need to add another 350 machines to monitoring soon and would like to get this resolved.
Thanks, Kevin
@ke3vin Do you have a multiple-item params/target section in your scrape configs? It did help in my case to reduce or almost eliminate the occurrence of the problem when I removed those params and started scraping everything from the /metrics endpoints.
Any additional thoughts on this? I'm having this issue almost daily when looking at ~20 machines. I need to add another 350 machines to monitoring soon and would like to get this resolved.
I've been flat out, sorry - and off on leave next week. If you can try to find a recipe that reproduces the failure as fast as possible, that'll help immensely when I get back (both for diagnosing and then testing fixes).
thanks.
@mac133k @ke3vin I've been trying (unsuccessfully) to reproduce this issue over the last couple of days. Submitting thousands of concurrent requests to and from pmproxy works just fine - I can see hundreds of new corrections arriving very quickly and then they slowly timeout or get reused, ultimately always ending up in the correct steady state of open fds - for both pmcd and pmproxy. I think I'm going to have to dig deeper into the Prometheus configurations being used to reproduce this. Can you provide prometheus.yml configuration snippets that show me how to mirror your setups? Thanks.
One of the mysteries to me is why pmcd is the one running out of fds (and why not also pmproxy) - AIUI in your setups you have a single Prometheus instance talking to many remote-host PCP setups (each remote host having one pmcd and one pmproxy, right?). In this situation we should see just a small number of fds for both pmcd and pmproxy. One fd between the two and one fd open to Prometheus for each host during a scrape - so Prometheus should have more open fds than anyone (assuming some form of keep-alive is happening there).
Any additional information that would help me reproduce this would be much appreciated - thanks!
Sorry for the delay, was in the middle of a big proposal. I should be back to being able to focus on this now. I've attached my prometheus config. We've been typically recovering from the issue by removing the large log files and restarting pmcd, but I've recently noticed that that appears to be leaving pmproxy in a broken state. Doing 'wget localhost:44323/metrics' returns 400 Bad Request. So pmproxy also needs to be restarted.
No problem @ke3vin - thanks for those config details. Any chance I could get a temporary login on one of the systems affected to attempt further diagnosis? (if this could be feasible, please contact me privately via my redhat.com address to set this up - thanks).
Has this issue been fixed? I've got repro systems that are on RHEL 8.4 with PCP updated to 5.3: rpm -qa | grep pcp pcp-pmda-hacluster-5.3.5-8.el8.x86_64 pcp-libs-5.3.5-8.el8.x86_64 pcp-conf-5.3.5-8.el8.x86_64 pcp-selinux-5.3.5-8.el8.x86_64 pcp-5.3.5-8.el8.x86_64
@rsponholtz not specifically, but there have been many changes in the meantime since this issue was opened (and since pcp-5.3.x) - a reproducible test case would be invaluable here.
VM information
We have got the same problem as reported by mac133k and ke3vin. A few dozen VMs reported disk full error and the storage was basically taken by pmcd.log filled with [time stamp] pmcd(1030) Error: AcceptNewClient(0): Unexpected error from __pmAccept: 24: Too many open files
log entries.
head of pmcd.log shows the following:
active agent dom pid in out ver protocol parameters
============ === ===== === === === ======== ==========
root 1 1033 6 7 2 bin pipe cmd=/var/lib/pcp/pmdas/root/pmdaroot
pmcd 2 2 dso i:6 lib=/var/lib/pcp/pmdas/pmcd/pmda_pmcd.so entry=pmcd_init [0x7fe4047d1fa0]
proc 3 1034 9 10 2 bin pipe cmd=/var/lib/pcp/pmdas/proc/pmdaproc -d 3
xfs 11 1039 11 12 2 bin pipe cmd=/var/lib/pcp/pmdas/xfs/pmdaxfs -d 11
linux 60 1044 13 14 2 bin pipe cmd=/var/lib/pcp/pmdas/linux/pmdalinux
pmproxy 4 2 dso i:7 lib=/var/lib/pcp/pmdas/mmv/pmda_mmv.so entry=pmproxy_init [0x7fe4047c0c50]
mmv 70 2 dso i:7 lib=/var/lib/pcp/pmdas/mmv/pmda_mmv.so entry=mmv_init [0x7fe4047c0c90]
jbd2 122 2 dso i:4 lib=/var/lib/pcp/pmdas/jbd2/pmda_jbd2.so entry=jbd2_init [0x7fe4047b7070]
kvm 95 1054 15 16 2 bin pipe cmd=/var/lib/pcp/pmdas/kvm/pmdakvm -d 95
Host access list:
00 01 Cur/MaxCons host-spec host-mask lvl host-name
== == =========== ======================================= ======================================= === ==============
y y 0 0 127.0.1.1 255.255.255.255 0 localhost
y y 0 0 / / 1 unix:
n 0 0 0.0.0.0 0.0.0.0 4 .*
n 0 0 :: :: 8 :*
User access list empty: user-based access control turned off
Group access list empty: group-based access control turned off
pmcd: PID = 1030, PDU version = 2
pmcd request port(s):
sts fd port family address
=== ==== ===== ====== =======
ok 4 unix /run/pcp/pmcd.socket
ok 0 44321 inet INADDR_ANY
ok 3 44321 ipv6 INADDR_ANY
__pmLogFindLocalPorts: scandir: Too many open files
__pmLogFindLocalPorts: scandir: Too many open files
__pmLogFindLocalPorts: scandir: Too many open files
__pmLogFindLocalPorts: scandir: Too many open files
syslog contains the following log entries. These pmie alerts maybe an indication of the start of the abnormal event. Could also be pmcd is running out of fds. pmcd process was at high utilization.
Apr 23 07:28:17 node-116 systemd[1]: pmie_farm_check.service: Deactivated successfully.
Apr 23 07:32:00 node-116 pcp-pmie[1811]: TCP listen overflows 0.0529275/sec@node-116
Apr 23 07:32:35 node-116 pcp-pmie[1811]: High aggregate context switch rate 85431ctxsw/s@node-116
Apr 23 07:41:25 node-116 pcp-pmie[1811]: TCP listen overflows 0.224999/sec@node-116
Apr 23 07:42:35 node-116 pcp-pmie[1811]: High aggregate context switch rate 110349ctxsw/s@node-116
Apr 23 07:51:25 node-116 pcp-pmie[1811]: TCP listen overflows 0.299987/sec@node-116
Apr 23 07:52:35 node-116 pcp-pmie[1811]: High aggregate context switch rate 113741ctxsw/s@node-116
lsof -c pmcd | grep localhost | wc -l
returns 1004 with output similar to the following:
pmcd 1030 pcp 1001u IPv4 26875416 0t0 TCP localhost:44321->localhost:44886 (ESTABLISHED)
lsof -c pmproxy | grep localhost | wc -l
returns 1004 with the following type of output:
pmproxy 1119894 pcp 1012u IPv4 26875776 0t0 TCP localhost:33448->localhost:44321 (ESTABLISHED)
pcp is installed as is without any configuration changes. There is no monitoring or any other non-pcp related process accessing pmcd/pmproxy/pmlogger.
Can't figure why there are so many connections between pmcd and pmproxy without an apparent triggering event.
One batch of hundreds of VMs does have any problems (over six months).
This 2nd batch of hundreds of VMs (created last month) is not so fortunate with dozens of VMs showing disk full error.
@weistonedawei do you observe the problem on any systems running a current version of PCP? (esp. in the v6.x.x series). Thanks.
@natoscott we have not tried PCP v6.x.x series. We are actually not using pcp although it is installed. Seems like pmproxy and pmcd is triggered by something to start communicating heavily. The 1k connections between pmcd and pmproxy are persistent it seems.
Noticed also on some VMs that there have been pmproxy segfaults. However, these segfaults do not exclusively occur on VMs with forementioned issue.
@weistonedawei oh, thats strange - is pmlogger running locally on the affected machines?
Yes, pmlogger is running locally on those affected machines.
@weistonedawei and is there a local redis-server running on that machine?
@natoscott No, there is no redis-server configured for pcp on these VMs. Sorry that I did not mention this.
No problem, just trying to get an understanding of what might be opening those fds. Would you be able to send me the PCP archive from the Apr 23rd incident date? (nathans-at-redhat-dot-com). Thanks.
Unfortunately I wouldn't be able to share the archives.
On Tue, May 14, 2024, 3:17 PM Nathan Scott @.***> wrote:
No problem, just trying to get an understanding of what might be opening those fds. Would you be able to send me the PCP archive from the Apr 23rd incident date? (nathans-at-redhat-dot-com). Thanks.
— Reply to this email directly, view it on GitHub https://github.com/performancecopilot/pcp/issues/1533#issuecomment-2111228435, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMK3SRYJVXD3FV5OWP4XKADZCKEO3AVCNFSM5O7JC6T2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMJRGEZDEOBUGM2Q . You are receiving this because you were mentioned.Message ID: @.***>
@weistonedawei no problem. If its a privacy issue, there is a pmlogredact(1) tool you might consider to anonymize the archive.
pmchart shows sockets increased from about 100 to about 2000 in 5 hours. At 17:30 or so something happened. The numbers of sockets increased in 10 minute interval (may be sample interval). Prior to 17:30 there seemed nothing strange from available pcp metrics (pmchart views).
We are actually not using pcp although it is installed [...] 10 minute interval (may be sample interval). [...] There is no monitoring or any other non-pcp related process accessing pmcd/pmproxy/pmlogger. [...] Can't figure why there are so many connections between pmcd and pmproxy without an apparent triggering event.
@weistonedawei something is definitely accessing pmcd and pmproxy regularly (could there be something/someone there accessing the pmproxy REST API?). We'll need more details of these accesses in order to help further. You may be able to use the -Dhttp debugging option to pmproxy to observe these requests.
I have about 30 VMs with CentOS 7 kernel v. 3.10.0-1160.42.2.el7.x86_64 and PCP version 5.3.5-1.x86_64 installed, core PMDAs only:
All of the hosts are only monitored with Prometheus. The problem I have been experiencing is that occasionally one of the host runs out of space on root partition due to pmcd.log filling up with messages like this one:
It takes about 2 days for pmcd.log to grow from 0 to 30GB. As a workaround I set up daily log rotatation for all PCP logs:
This reduced the occurrence of the problem, but did not eliminate it. I noticed pmcd logs of 3GB in size where logrotation and service restart prevented a further growth, however today I saw pmcd.conf that grew up to 30GB and filled up the free space which which probably caused logrotate to fail, because logs of other service had not been rotated for 2 days as well.
I am not sure if this is related, but a day before the incident there was a segfault recorded in /var/log/messages:
It seems like PCP recovered from this error. About 4 hours later log rotation took place and pmcd was restarted as expected. About 6 hours after the rotation the first error was logged in pmcd.log, however system log did not record any other problems around that time. In pmproxy.log there were 2 errors logged without time stamps:
No errors were recorded in PMDA's logs.
If anyone is interested in investigating this issue I can provide more debug info if necessary.