pmcd Error: Too many open files fills up the log

mac133k commented 2 years ago

I have about 30 VMs with CentOS 7 kernel v. 3.10.0-1160.42.2.el7.x86_64 and PCP version 5.3.5-1.x86_64 installed, core PMDAs only:

root    1   pipe    binary      /var/lib/pcp/pmdas/root/pmdaroot
pmcd    2   dso pmcd_init   /var/lib/pcp/pmdas/pmcd/pmda_pmcd.so
proc    3   pipe    binary      /var/lib/pcp/pmdas/proc/pmdaproc -d 3
pmproxy 4   dso pmproxy_init    /var/lib/pcp/pmdas/mmv/pmda_mmv.so
xfs 11  pipe    binary      /var/lib/pcp/pmdas/xfs/pmdaxfs -d 11
linux   60  pipe    binary      /var/lib/pcp/pmdas/linux/pmdalinux
mmv 70  dso mmv_init    /var/lib/pcp/pmdas/mmv/pmda_mmv.so
kvm 95  pipe    binary      /var/lib/pcp/pmdas/kvm/pmdakvm -d 95
jbd2    122 dso jbd2_init   /var/lib/pcp/pmdas/jbd2/pmda_jbd2.so

All of the hosts are only monitored with Prometheus. The problem I have been experiencing is that occasionally one of the host runs out of space on root partition due to pmcd.log filling up with messages like this one:

[Mon Feb 21 16:25:49] pmcd(99908) Error: AcceptNewClient(0): Unexpected error from __pmAccept: 24: Too many open files

It takes about 2 days for pmcd.log to grow from 0 to 30GB. As a workaround I set up daily log rotatation for all PCP logs:

/var/log/pcp/*/*.log {
    rotate 5
    daily
    missingok
    copytruncate
    su root root
    lastaction
        /bin/systemctl restart pmcd pmproxy
    endscript
}

This reduced the occurrence of the problem, but did not eliminate it. I noticed pmcd logs of 3GB in size where logrotation and service restart prevented a further growth, however today I saw pmcd.conf that grew up to 30GB and filled up the free space which which probably caused logrotate to fail, because logs of other service had not been rotated for 2 days as well.

I am not sure if this is related, but a day before the incident there was a segfault recorded in /var/log/messages:

Feb 18 23:47:21 node16 kernel: pmproxy[74409]: segfault at 0 ip 00007f84bbb04eb0 sp 00007f84b5e92b00 error 4 in libuv.so.1.0.0[7f84bbafa000+2c000]
Feb 18 23:47:21 node16 systemd: pmproxy.service: main process exited, code=killed, status=11/SEGV
Feb 18 23:47:21 node16 systemd: Unit pmproxy.service entered failed state.
Feb 18 23:47:21 node16 systemd: pmproxy.service failed.
Feb 18 23:47:22 node16 systemd: pmproxy.service holdoff time over, scheduling restart.
Feb 18 23:47:22 node16 systemd: Stopped Proxy for Performance Metrics Collector Daemon.
Feb 18 23:47:22 node16 systemd: Starting Proxy for Performance Metrics Collector Daemon...
Feb 18 23:47:22 node16 systemd: Started Proxy for Performance Metrics Collector Daemon.

It seems like PCP recovered from this error. About 4 hours later log rotation took place and pmcd was restarted as expected. About 6 hours after the rotation the first error was logged in pmcd.log, however system log did not record any other problems around that time. In pmproxy.log there were 2 errors logged without time stamps:

on_write_callback: ERROR uv_write failed
on_write_callback: ERROR uv_write failed

No errors were recorded in PMDA's logs.

If anyone is interested in investigating this issue I can provide more debug info if necessary.

natoscott commented 2 years ago

hi @mac133k - thanks for reporting this! It's a new one to us ...

pmcd.log filling up with messages like this one:

Could you capture contents of "sudo ls -l /proc/[pmcd PID]/fd" under these conditions and share? It'll be alot of output I imagine, so maybe a summary of what the file descriptors look like, or copy the whole lot to a github gist perhaps?

It takes about 2 days for pmcd.log to grow from 0 to 30GB.

Wow, OK. That log message is ~120bytes. If we work off 2 days, that's 172800 seconds - at 30GB, if my back-of-the-napkin math is correct you're getting this message logged ~1500 times per second ... ? I can't think of anything in PCP itself that might do that, so really interested to see that fd listing - it feels like it must be a rogue pmcd client rapidly attempting new connections.

| Feb 18 23:47:21 node16 kernel: pmproxy[74409]: segfault at 0 ip 00007f84bbb04eb0 sp 00007f84b5e92b00 error 4 in libuv.so.1.0.0[7f84bbafa000+2c000]

If you have a core file somewhere (abrt?) it'd be really handy to get a symbolic stack trace from this.

mac133k commented 2 years ago

I had another another occurrence of this problem. This time there was no segfault, so that must have been unrelated, unfortunately coredump was not available to look deeper into it. I did manage to capture more debug data for the main issue:

this time pmcd.log grew from 0 to 46GB in 32 hours.
there were 1024 FDs held by pmcd: 1012 sockets 13 pipes ** 1 file: pmcd.log
in /etc/security/limits.conf:
soft nofile 1048576
hard nofile 1048576
no other limits in /etc/security/limits.d

only one instance of each PCP process:

pcp       24331      1 80 Feb20 ?        1-23:05:32 /usr/libexec/pcp/bin/pmcd -A
root      24335  24331  5 Feb20 ?        03:22:57 /var/lib/pcp/pmdas/root/pmdaroot
root      24336  24335  5 Feb20 ?        03:02:23 /var/lib/pcp/pmdas/proc/pmdaproc -d 3
root      24338  24335  4 Feb20 ?        02:51:53 /var/lib/pcp/pmdas/xfs/pmdaxfs -d 11
root      24339  24335  4 Feb20 ?        02:37:18 /var/lib/pcp/pmdas/linux/pmdalinux
root      24340  24335  3 Feb20 ?        02:20:07 /var/lib/pcp/pmdas/kvm/pmdakvm -d 95
pcp       24341      1  0 Feb20 ?        00:12:28 /usr/libexec/pcp/bin/pmproxy -F -A

pmproxy.log recorded:

on_write_callback: ERROR uv_write failed
on_write_callback: ERROR uv_write failed

mac133k commented 2 years ago

A few more details that I need to add:

VMs are monitored by Prometheus via PMProxy
there are 2 Prometheus instances scraping metrics every 30 seconds: one instance scrapes everything from /metrics which amounts up to 6.5k metric instances (half of them being cgroup metrics), scrape durations are between 1.2 and 1.6 seconds another instance scrapes only selected metrics:

job_name: pcp honor_timestamps: true params: target:

kernel,mem,swap,network,filesys,disk,hinv


so the URL is ```http://< hostname >:44323/metrics?target=kernel%2Cmem%2Cswap%2Cnetwork%2Cfilesys%2Cdisk%2Chinv``` and the number of metric instances is about 1.5k which takes between 0.4 and 0.7 seconds to scrape.

PCP processes do not seem to be overwhelmed with scrape requests judging by the CPU and memory utilisation per process. I suspect using the target param may be somehow causing the issue, because I remember having a problem some time ago with PCP v5.0 or 5.1 where scrapes were incomplete or timing out until I changed Prometheus config and split the original job into multiple jobs with only one metric group as a target each. I'm going to kill the Prometheus instance that has targets in the config and see if it makes any difference here.

natoscott commented 2 years ago

| there were 1024 FDs held by pmcd:

Hmm, suspicious - seems like that maxfd setting of 1048576 hasn't taken effect? Might be worth checking it while pmcd is running via "cat /proc/[pmcd PID]/limits".

mac133k commented 2 years ago

Indeed:

# cat /proc/83150/limits 
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             31128                31128                processes 
Max open files            1024                 4096                 files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       31128                31128                signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us

However 1024 is way more than a healthy PMCD process needs, so I would rather dig deeper to find the root cause than lift this limit wherever it is set.

natoscott commented 2 years ago

Agreed, 1024 should be plenty. I wonder if we might be seeing a failure to cleanup timed-out connections between pmproxy and pmcd. You could experiment with larger values (well beyond your Prometheus scrape interval) of the polltimeout setting - in /etc/pcp/pmproxy/pmproxy.conf that would be:

[pmwebapi]
timeout = 120

(this is a value in seconds)

mac133k commented 2 years ago

Querying metric endpoints without the target params helped to significantly reduce the occurrence of this issue, but did not eliminate it completely.

ke3vin commented 2 years ago

Just thought I'd chime in here- I'm having the same problem. Using prometheus, RedHat 8.5, and PCP 5.3.5. pmcd runs fine for weeks, but at some point something makes it go nuts and it rapidly consumes file descriptors until it hits the limit, and then the log file fills up my disk partition very rapidly.

Kevin

-- Kevin Hildebrand University of Maryland

natoscott commented 2 years ago

@ke3vin @mac133k could you do some analysis of the contents of the /proc/[pmcd PID]/fds directory when the issue occurs? I think that's got to be the first bit of information we need to diagnose further.

If someone can produce an exact recipe to induce the failure (ideally as fast as possible), please paste it here too. Thanks!

ke3vin commented 2 years ago

I haven't figured out how to duplicate it yet, but as I have it on about two dozen hosts, at least one of them seems to get in the situation almost daily. I've got one that way now. Lots of open sockets from localhost->localhost:

From lsof: (just a small chunk of the thousand+ open sockets) pmcd 2236 pcp 1009u IPv6 5670302 0t0 TCP localhost:pmcd->localhost:37096 (ESTABLISHED) pmcd 2236 pcp 1010u IPv6 5670335 0t0 TCP localhost:pmcd->localhost:37150 (ESTABLISHED) pmcd 2236 pcp 1011u IPv6 5670343 0t0 TCP localhost:pmcd->localhost:37204 (ESTABLISHED) pmcd 2236 pcp 1012u IPv6 5671407 0t0 TCP localhost:pmcd->localhost:37258 (ESTABLISHED) pmcd 2236 pcp 1013u IPv6 5670348 0t0 TCP localhost:pmcd->localhost:37312 (ESTABLISHED) pmcd 2236 pcp 1014u IPv6 5670397 0t0 TCP localhost:pmcd->localhost:37366 (ESTABLISHED) pmcd 2236 pcp 1015u IPv6 5673037 0t0 TCP localhost:pmcd->localhost:37420 (ESTABLISHED) pmcd 2236 pcp 1016u IPv6 5669608 0t0 TCP localhost:pmcd->localhost:37474 (ESTABLISHED) pmcd 2236 pcp 1017u IPv6 5675230 0t0 TCP localhost:pmcd->localhost:37528 (ESTABLISHED) pmcd 2236 pcp 1018u IPv6 5675262 0t0 TCP localhost:pmcd->localhost:37582 (ESTABLISHED) pmcd 2236 pcp 1019u IPv6 5669637 0t0 TCP localhost:pmcd->localhost:37636 (ESTABLISHED) pmcd 2236 pcp 1020u IPv6 5676375 0t0 TCP localhost:pmcd->localhost:37690 (ESTABLISHED)

A similar snippet from /proc/2236/fd lrwx------. 1 root root 64 May 9 08:48 979 -> 'socket:[5652133]' lrwx------. 1 root root 64 May 9 08:48 98 -> 'socket:[5245025]' lrwx------. 1 root root 64 May 9 08:48 980 -> 'socket:[5645176]' lrwx------. 1 root root 64 May 9 08:48 981 -> 'socket:[5655122]' lrwx------. 1 root root 64 May 9 08:48 982 -> 'socket:[5652150]' lrwx------. 1 root root 64 May 9 08:48 983 -> 'socket:[5645233]' lrwx------. 1 root root 64 May 9 08:48 984 -> 'socket:[5655916]' lrwx------. 1 root root 64 May 9 08:48 985 -> 'socket:[5655964]' lrwx------. 1 root root 64 May 9 08:48 986 -> 'socket:[5645247]' lrwx------. 1 root root 64 May 9 08:48 987 -> 'socket:[5661697]' lrwx------. 1 root root 64 May 9 08:48 988 -> 'socket:[5656244]' lrwx------. 1 root root 64 May 9 08:48 989 -> 'socket:[5661430]' lrwx------. 1 root root 64 May 9 08:48 99 -> 'socket:[5243452]' lrwx------. 1 root root 64 May 9 08:48 990 -> 'socket:[5661491]'

I'll leave the broken process running for a bit, if you've got specific stuff you want me to look for, let me know.

Thanks, Kevin

On Sun, May 8, 2022 at 8:11 PM Nathan Scott @.***> wrote:

@ke3vin https://github.com/ke3vin @mac133k https://github.com/mac133k could you do some analysis of the contents of the /proc/[pmcd PID]/fds directory when the issue occurs? I think that's got to be the first bit of information we need to diagnose further.

If someone can produce an exact recipe to induce the failure (ideally as fast as possible), please paste it here too. Thanks!

— Reply to this email directly, view it on GitHub https://github.com/performancecopilot/pcp/issues/1533#issuecomment-1120514995, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATLPEMH5L5UF2CUCJPBARDVJBJ2XANCNFSM5O7JC6TQ . You are receiving this because you were mentioned.Message ID: @.***>

ke3vin commented 2 years ago

Ah, at least on my host, it's 'pmproxy' that has the other end of those connections...

pmproxy 2312 pcp 1000u IPv6 5656573 0t0 TCP localhost:37042->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1001u IPv6 5669522 0t0 TCP localhost:37096->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1002u IPv6 5667108 0t0 TCP localhost:37150->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1003u IPv6 5671371 0t0 TCP localhost:37204->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1004u IPv6 5669581 0t0 TCP localhost:37258->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1005u IPv6 5669590 0t0 TCP localhost:37312->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1006u IPv6 5672226 0t0 TCP localhost:37366->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1007u IPv6 5670403 0t0 TCP localhost:37420->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1008u IPv6 5673068 0t0 TCP localhost:37474->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1009u IPv6 5669630 0t0 TCP localhost:37528->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1010u IPv6 5670467 0t0 TCP localhost:37582->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1011u IPv6 5670487 0t0 TCP localhost:37636->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1012u IPv6 5670512 0t0 TCP localhost:37690->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1013u IPv6 5669668 0t0 TCP localhost:37744->localhost:pmcd (ESTABLISHED)

On Mon, May 9, 2022 at 8:49 AM Kevin M. Hildebrand @.***> wrote:

I haven't figured out how to duplicate it yet, but as I have it on about two dozen hosts, at least one of them seems to get in the situation almost daily. I've got one that way now. Lots of open sockets from localhost->localhost:

From lsof: (just a small chunk of the thousand+ open sockets) pmcd 2236 pcp 1009u IPv6 5670302 0t0 TCP localhost:pmcd->localhost:37096 (ESTABLISHED) pmcd 2236 pcp 1010u IPv6 5670335 0t0 TCP localhost:pmcd->localhost:37150 (ESTABLISHED) pmcd 2236 pcp 1011u IPv6 5670343 0t0 TCP localhost:pmcd->localhost:37204 (ESTABLISHED) pmcd 2236 pcp 1012u IPv6 5671407 0t0 TCP localhost:pmcd->localhost:37258 (ESTABLISHED) pmcd 2236 pcp 1013u IPv6 5670348 0t0 TCP localhost:pmcd->localhost:37312 (ESTABLISHED) pmcd 2236 pcp 1014u IPv6 5670397 0t0 TCP localhost:pmcd->localhost:37366 (ESTABLISHED) pmcd 2236 pcp 1015u IPv6 5673037 0t0 TCP localhost:pmcd->localhost:37420 (ESTABLISHED) pmcd 2236 pcp 1016u IPv6 5669608 0t0 TCP localhost:pmcd->localhost:37474 (ESTABLISHED) pmcd 2236 pcp 1017u IPv6 5675230 0t0 TCP localhost:pmcd->localhost:37528 (ESTABLISHED) pmcd 2236 pcp 1018u IPv6 5675262 0t0 TCP localhost:pmcd->localhost:37582 (ESTABLISHED) pmcd 2236 pcp 1019u IPv6 5669637 0t0 TCP localhost:pmcd->localhost:37636 (ESTABLISHED) pmcd 2236 pcp 1020u IPv6 5676375 0t0 TCP localhost:pmcd->localhost:37690 (ESTABLISHED)

A similar snippet from /proc/2236/fd lrwx------. 1 root root 64 May 9 08:48 979 -> 'socket:[5652133]' lrwx------. 1 root root 64 May 9 08:48 98 -> 'socket:[5245025]' lrwx------. 1 root root 64 May 9 08:48 980 -> 'socket:[5645176]' lrwx------. 1 root root 64 May 9 08:48 981 -> 'socket:[5655122]' lrwx------. 1 root root 64 May 9 08:48 982 -> 'socket:[5652150]' lrwx------. 1 root root 64 May 9 08:48 983 -> 'socket:[5645233]' lrwx------. 1 root root 64 May 9 08:48 984 -> 'socket:[5655916]' lrwx------. 1 root root 64 May 9 08:48 985 -> 'socket:[5655964]' lrwx------. 1 root root 64 May 9 08:48 986 -> 'socket:[5645247]' lrwx------. 1 root root 64 May 9 08:48 987 -> 'socket:[5661697]' lrwx------. 1 root root 64 May 9 08:48 988 -> 'socket:[5656244]' lrwx------. 1 root root 64 May 9 08:48 989 -> 'socket:[5661430]' lrwx------. 1 root root 64 May 9 08:48 99 -> 'socket:[5243452]' lrwx------. 1 root root 64 May 9 08:48 990 -> 'socket:[5661491]'

I'll leave the broken process running for a bit, if you've got specific stuff you want me to look for, let me know.

Thanks, Kevin

On Sun, May 8, 2022 at 8:11 PM Nathan Scott @.***> wrote:

@ke3vin https://github.com/ke3vin @mac133k https://github.com/mac133k could you do some analysis of the contents of the /proc/[pmcd PID]/fds directory when the issue occurs? I think that's got to be the first bit of information we need to diagnose further.

If someone can produce an exact recipe to induce the failure (ideally as fast as possible), please paste it here too. Thanks!

— Reply to this email directly, view it on GitHub https://github.com/performancecopilot/pcp/issues/1533#issuecomment-1120514995, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATLPEMH5L5UF2CUCJPBARDVJBJ2XANCNFSM5O7JC6TQ . You are receiving this because you were mentioned.Message ID: @.***>

ke3vin commented 2 years ago

Any additional thoughts on this? I'm having this issue almost daily when looking at ~20 machines. I need to add another 350 machines to monitoring soon and would like to get this resolved.

Thanks, Kevin

mac133k commented 2 years ago

@ke3vin Do you have a multiple-item params/target section in your scrape configs? It did help in my case to reduce or almost eliminate the occurrence of the problem when I removed those params and started scraping everything from the /metrics endpoints.

natoscott commented 2 years ago

Any additional thoughts on this? I'm having this issue almost daily when looking at ~20 machines. I need to add another 350 machines to monitoring soon and would like to get this resolved.

I've been flat out, sorry - and off on leave next week. If you can try to find a recipe that reproduces the failure as fast as possible, that'll help immensely when I get back (both for diagnosing and then testing fixes).

thanks.

natoscott commented 2 years ago

@mac133k @ke3vin I've been trying (unsuccessfully) to reproduce this issue over the last couple of days. Submitting thousands of concurrent requests to and from pmproxy works just fine - I can see hundreds of new corrections arriving very quickly and then they slowly timeout or get reused, ultimately always ending up in the correct steady state of open fds - for both pmcd and pmproxy. I think I'm going to have to dig deeper into the Prometheus configurations being used to reproduce this. Can you provide prometheus.yml configuration snippets that show me how to mirror your setups? Thanks.

One of the mysteries to me is why pmcd is the one running out of fds (and why not also pmproxy) - AIUI in your setups you have a single Prometheus instance talking to many remote-host PCP setups (each remote host having one pmcd and one pmproxy, right?). In this situation we should see just a small number of fds for both pmcd and pmproxy. One fd between the two and one fd open to Prometheus for each host during a scrape - so Prometheus should have more open fds than anyone (assuming some form of keep-alive is happening there).

Any additional information that would help me reproduce this would be much appreciated - thanks!

ke3vin commented 2 years ago

Sorry for the delay, was in the middle of a big proposal. I should be back to being able to focus on this now. I've attached my prometheus config. We've been typically recovering from the issue by removing the large log files and restarting pmcd, but I've recently noticed that that appears to be leaving pmproxy in a broken state. Doing 'wget localhost:44323/metrics' returns 400 Bad Request. So pmproxy also needs to be restarted.

prometheus.yml.txt

natoscott commented 2 years ago

No problem @ke3vin - thanks for those config details. Any chance I could get a temporary login on one of the systems affected to attempt further diagnosis? (if this could be feasible, please contact me privately via my redhat.com address to set this up - thanks).

rsponholtz commented 4 months ago

Has this issue been fixed? I've got repro systems that are on RHEL 8.4 with PCP updated to 5.3: rpm -qa | grep pcp pcp-pmda-hacluster-5.3.5-8.el8.x86_64 pcp-libs-5.3.5-8.el8.x86_64 pcp-conf-5.3.5-8.el8.x86_64 pcp-selinux-5.3.5-8.el8.x86_64 pcp-5.3.5-8.el8.x86_64

natoscott commented 4 months ago

@rsponholtz not specifically, but there have been many changes in the meantime since this issue was opened (and since pcp-5.3.x) - a reproducible test case would be invaluable here.

weistonedawei commented 4 months ago

VM information

VM OS: Ubuntu 22.04
Kernel: 5.15.0-87-generic
pcp package: 5.3.6-1build1

We have got the same problem as reported by mac133k and ke3vin. A few dozen VMs reported disk full error and the storage was basically taken by pmcd.log filled with [time stamp] pmcd(1030) Error: AcceptNewClient(0): Unexpected error from __pmAccept: 24: Too many open files log entries.

head of pmcd.log shows the following:


active agent dom   pid  in out ver protocol parameters
============ === ===== === === === ======== ==========
root           1  1033   6   7   2 bin pipe cmd=/var/lib/pcp/pmdas/root/pmdaroot
pmcd           2                 2 dso i:6  lib=/var/lib/pcp/pmdas/pmcd/pmda_pmcd.so entry=pmcd_init [0x7fe4047d1fa0]
proc           3  1034   9  10   2 bin pipe cmd=/var/lib/pcp/pmdas/proc/pmdaproc -d 3
xfs           11  1039  11  12   2 bin pipe cmd=/var/lib/pcp/pmdas/xfs/pmdaxfs -d 11
linux         60  1044  13  14   2 bin pipe cmd=/var/lib/pcp/pmdas/linux/pmdalinux
pmproxy        4                 2 dso i:7  lib=/var/lib/pcp/pmdas/mmv/pmda_mmv.so entry=pmproxy_init [0x7fe4047c0c50]
mmv           70                 2 dso i:7  lib=/var/lib/pcp/pmdas/mmv/pmda_mmv.so entry=mmv_init [0x7fe4047c0c90]
jbd2         122                 2 dso i:4  lib=/var/lib/pcp/pmdas/jbd2/pmda_jbd2.so entry=jbd2_init [0x7fe4047b7070]
kvm           95  1054  15  16   2 bin pipe cmd=/var/lib/pcp/pmdas/kvm/pmdakvm -d 95

Host access list:
00 01 Cur/MaxCons host-spec                               host-mask                               lvl host-name
== == =========== ======================================= ======================================= === ==============
 y  y     0     0 127.0.1.1                               255.255.255.255                           0 localhost
 y  y     0     0 /                                       /                                         1 unix:
    n     0     0 0.0.0.0                                 0.0.0.0                                   4 .*
    n     0     0 ::                                      ::                                        8 :*
User access list empty: user-based access control turned off
Group access list empty: group-based access control turned off

pmcd: PID = 1030, PDU version = 2
pmcd request port(s):
  sts fd   port  family address
  === ==== ===== ====== =======
  ok     4       unix   /run/pcp/pmcd.socket
  ok     0 44321 inet   INADDR_ANY
  ok     3 44321 ipv6   INADDR_ANY
__pmLogFindLocalPorts: scandir: Too many open files
__pmLogFindLocalPorts: scandir: Too many open files
__pmLogFindLocalPorts: scandir: Too many open files
__pmLogFindLocalPorts: scandir: Too many open files

syslog contains the following log entries. These pmie alerts maybe an indication of the start of the abnormal event. Could also be pmcd is running out of fds. pmcd process was at high utilization.

Apr 23 07:28:17 node-116 systemd[1]: pmie_farm_check.service: Deactivated successfully.
Apr 23 07:32:00 node-116 pcp-pmie[1811]: TCP listen overflows 0.0529275/sec@node-116
Apr 23 07:32:35 node-116 pcp-pmie[1811]: High aggregate context switch rate 85431ctxsw/s@node-116
Apr 23 07:41:25 node-116 pcp-pmie[1811]: TCP listen overflows 0.224999/sec@node-116
Apr 23 07:42:35 node-116 pcp-pmie[1811]: High aggregate context switch rate 110349ctxsw/s@node-116
Apr 23 07:51:25 node-116 pcp-pmie[1811]: TCP listen overflows 0.299987/sec@node-116
Apr 23 07:52:35 node-116 pcp-pmie[1811]: High aggregate context switch rate 113741ctxsw/s@node-116

lsof -c pmcd | grep localhost | wc -l returns 1004 with output similar to the following:

pmcd    1030  pcp 1001u  IPv4           26875416          0t0     TCP localhost:44321->localhost:44886 (ESTABLISHED)

lsof -c pmproxy | grep localhost | wc -l returns 1004 with the following type of output:

pmproxy 1119894  pcp 1012u     IPv4           26875776      0t0     TCP localhost:33448->localhost:44321 (ESTABLISHED)

pcp is installed as is without any configuration changes. There is no monitoring or any other non-pcp related process accessing pmcd/pmproxy/pmlogger.

Can't figure why there are so many connections between pmcd and pmproxy without an apparent triggering event.

One batch of hundreds of VMs does have any problems (over six months).

This 2nd batch of hundreds of VMs (created last month) is not so fortunate with dozens of VMs showing disk full error.

natoscott commented 4 months ago

@weistonedawei do you observe the problem on any systems running a current version of PCP? (esp. in the v6.x.x series). Thanks.

weistonedawei commented 4 months ago

@natoscott we have not tried PCP v6.x.x series. We are actually not using pcp although it is installed. Seems like pmproxy and pmcd is triggered by something to start communicating heavily. The 1k connections between pmcd and pmproxy are persistent it seems.

Noticed also on some VMs that there have been pmproxy segfaults. However, these segfaults do not exclusively occur on VMs with forementioned issue.

natoscott commented 4 months ago

@weistonedawei oh, thats strange - is pmlogger running locally on the affected machines?

weistonedawei commented 4 months ago

Yes, pmlogger is running locally on those affected machines.

natoscott commented 4 months ago

@weistonedawei and is there a local redis-server running on that machine?

weistonedawei commented 4 months ago

@natoscott No, there is no redis-server configured for pcp on these VMs. Sorry that I did not mention this.

natoscott commented 4 months ago

No problem, just trying to get an understanding of what might be opening those fds. Would you be able to send me the PCP archive from the Apr 23rd incident date? (nathans-at-redhat-dot-com). Thanks.

weistonedawei commented 4 months ago

Unfortunately I wouldn't be able to share the archives.

On Tue, May 14, 2024, 3:17 PM Nathan Scott @.***> wrote:

No problem, just trying to get an understanding of what might be opening those fds. Would you be able to send me the PCP archive from the Apr 23rd incident date? (nathans-at-redhat-dot-com). Thanks.

— Reply to this email directly, view it on GitHub https://github.com/performancecopilot/pcp/issues/1533#issuecomment-2111228435, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMK3SRYJVXD3FV5OWP4XKADZCKEO3AVCNFSM5O7JC6T2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMJRGEZDEOBUGM2Q . You are receiving this because you were mentioned.Message ID: @.***>

natoscott commented 4 months ago

@weistonedawei no problem. If its a privacy issue, there is a pmlogredact(1) tool you might consider to anonymize the archive.

weistonedawei commented 4 months ago

node116_sockets_12minute_resolution

pmchart shows sockets increased from about 100 to about 2000 in 5 hours. At 17:30 or so something happened. The numbers of sockets increased in 10 minute interval (may be sample interval). Prior to 17:30 there seemed nothing strange from available pcp metrics (pmchart views).

natoscott commented 4 months ago

We are actually not using pcp although it is installed [...] 10 minute interval (may be sample interval). [...] There is no monitoring or any other non-pcp related process accessing pmcd/pmproxy/pmlogger. [...] Can't figure why there are so many connections between pmcd and pmproxy without an apparent triggering event.

@weistonedawei something is definitely accessing pmcd and pmproxy regularly (could there be something/someone there accessing the pmproxy REST API?). We'll need more details of these accesses in order to help further. You may be able to use the -Dhttp debugging option to pmproxy to observe these requests.

performancecopilot / pcp

pmcd Error: Too many open files fills up the log #1533