performancecopilot / pcp

Performance Co-Pilot
https://pcp.io
Other
972 stars 236 forks source link

Possible memory leak in pmproxy #804

Closed oleksandrborniak closed 4 years ago

oleksandrborniak commented 4 years ago

Does it normal, that pmproxy consumes about 11GB of memory in 24 hours ? OS: Fedora 31 Packages :

[root@fc ~]# rpm -qa | grep pcp
pcp-pmda-bcc-5.0.1-1.fc31.x86_64
pcp-libs-5.0.1-1.fc31.x86_64
pcp-5.0.1-1.fc31.x86_64
pcp-doc-5.0.1-1.fc31.noarch
cockpit-pcp-207-1.fc31.x86_64
pcp-system-tools-5.0.1-1.fc31.x86_64
pcp-selinux-5.0.1-1.fc31.x86_64
pcp-conf-5.0.1-1.fc31.x86_64
python3-pcp-5.0.1-1.fc31.x86_64
[root@fc ~]#

It is statistic for one hour:

[root@fc ~]# systemctl status pmproxy
● pmproxy.service - Proxy for Performance Metrics Collector Daemon
   Loaded: loaded (/usr/lib/systemd/system/pmproxy.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2019-12-04 11:14:09 EET; 1h 0min ago
     Docs: man:pmproxy(8)
  Process: 124471 ExecStart=/usr/share/pcp/lib/pmproxy start (code=exited, status=0/SUCCESS)
 Main PID: 124518 (pmproxy)
    Tasks: 2 (limit: 19172)
   Memory: 536.5M
      CPU: 1min 32.831s
   CGroup: /system.slice/pmproxy.service
           └─124518 /usr/libexec/pcp/bin/pmproxy

Dec 04 11:14:09 fc.my-clouds.net systemd[1]: Starting Proxy for Performance Metrics Collector Daemon...
Dec 04 11:14:09 fc.my-clouds.net pmproxy[124471]: Starting pmproxy ...
Dec 04 11:14:09 fc.my-clouds.net systemd[1]: Started Proxy for Performance Metrics Collector Daemon.
[root@fc ~]#
[root@fc tools]# ./memleak -ap $(pidof pmproxy) 600
    316400 bytes in 7910 allocations from stack
        [unknown] [libpcp_web.so.1]
        [unknown] [libpcp_web.so.1]
        [unknown]
    316400 bytes in 7910 allocations from stack
        [unknown] [libpcp_web.so.1]
        [unknown] [libpcp_web.so.1]
        [unknown]
    316400 bytes in 7910 allocations from stack
        [unknown] [libpcp_web.so.1]
        [unknown] [libpcp_web.so.1]
        [unknown]
    316400 bytes in 7910 allocations from stack
        [unknown] [libpcp_web.so.1]
        [unknown] [libpcp_web.so.1]
        [unknown]
    1634400 bytes in 40860 allocations from stack
        [unknown] [libpcp_web.so.1]
        [unknown] [libpcp_web.so.1]
        [unknown]
    1634400 bytes in 40860 allocations from stack
        [unknown] [libpcp_web.so.1]
        [unknown] [libpcp_web.so.1]
        [unknown]
    1634400 bytes in 40860 allocations from stack
        [unknown] [libpcp_web.so.1]
        [unknown] [libpcp_web.so.1]
        [unknown]
    1635360 bytes in 40870 allocations from stack
        [unknown] [libpcp_web.so.1]
        [unknown] [libpcp_web.so.1]
        [unknown]
    6117289 bytes in 25416 allocations from stack
        [unknown] [libpcp_web.so.1]
    145906579 bytes in 215901 allocations from stack
        [unknown] [libpcp_web.so.1]
^C[root@fc tools]
natoscott commented 4 years ago

@oleksandrborniak these issues should now be all resolved, through commits (below) destined for pcp-5.0.2 (scheduled for Wed 11th Dec)

85b773befe2c0b25ef5792b56f2467af5a7caae1 f871de05b800c4e114e8493079fc6ae09e4b0e16 e91dc3ac386563d7299d160dc05ddeee521fda21 2c337c4492d8c904942fcbe8fb317f8264e9d01c a797f08c0fd7d674592ca9642047143beaafecc9 0c3d90349a0e0fa21c8f1d9f7e9a9c8d7a145377 a9495cb066e770a6df61aa5116ef4f091ff02407

If you'd like to try building PCP from the master branch in the meantime to confirm, that'd be excellent.

oleksandrborniak commented 4 years ago

@natoscott Thank you Nathan! I'll try pcp-5.0.2 when it is available in fedora repository. Have a nice day!

oleksandrborniak commented 4 years ago

@natoscott It looks like issue hasn't been resolved with pcp-5.0.2

[root@fc ~]# rpm -qa | grep pcp
pcp-system-tools-5.0.2-1.fc31.x86_64
pcp-conf-5.0.2-1.fc31.x86_64
pcp-pmda-bcc-5.0.2-1.fc31.x86_64
python3-pcp-5.0.2-1.fc31.x86_64
cockpit-pcp-208-1.fc31.x86_64
pcp-libs-5.0.2-1.fc31.x86_64
pcp-5.0.2-1.fc31.x86_64
pcp-selinux-5.0.2-1.fc31.x86_64
pcp-doc-5.0.2-1.fc31.noarch
[root@fc ~]#

[root@fc ~]# uptime
 11:51:04 up 1 day, 24 min,  1 user,  load average: 0.00, 0.05, 0.04
[root@fc ~]#
[root@fc ~]# systemctl status pmproxy
● pmproxy.service - Proxy for Performance Metrics Collector Daemon
   Loaded: loaded (/usr/lib/systemd/system/pmproxy.service; enabled; vendor preset: disabled)
   Active: active (running) since Sun 2020-01-05 11:26:23 EET; 24h ago
     Docs: man:pmproxy(8)
  Process: 1049 ExecStart=/usr/share/pcp/lib/pmproxy start (code=exited, status=0/SUCCESS)
 Main PID: 1243 (pmproxy)
    Tasks: 2 (limit: 19172)
   Memory: 11.9G
      CPU: 5min 54.430s
   CGroup: /system.slice/pmproxy.service
           └─1243 /usr/libexec/pcp/bin/pmproxy

Jan 05 11:26:22 fc.my-clouds.net systemd[1]: Starting Proxy for Performance Metrics Collector Daemon...
Jan 05 11:26:23 fc.my-clouds.net pmproxy[1049]: Starting pmproxy ...
Jan 05 11:26:23 fc.my-clouds.net systemd[1]: Started Proxy for Performance Metrics Collector Daemon.
[root@fc ~]#
===============================================================================
  pcp-5.0.2-1.fc31
===============================================================================
  Update ID: FEDORA-2019-ae1b110581
       Type: bugfix
    Updated: 2020-01-06 02:14:25
Description:     - Server-side utilities and log management scripts:
           :    pmproxy: REST API to report all label values by name
           :    pmproxy: fix memory leak when re-using http connections
           :    pmproxy: fix memory leak on a failed REST API call
           :    pmproxy: improvements to the instrumentation mechanism
           :    pmproxy: add support for AF_UNIX local client connections
           :    logutil: fix pmlogger service timeout on new installations
           :    pmie: on SIGTERM, exit 0 for a normal shutdown rather than exit 15
           :    pmcd: add labels PDUs into the trace set
           :     - PMDA additions, enhancements and bug fixes:
           :    pmdaproc: support for kernel cgroups v2 (and systemd using same)
           :    pmdalinux: update per-device read_bytes, write_bytes metric types
           :    pmdabpftrace: implement label callback
           :    pmdabpftrace: clear context tab on endcontext
           :    pmdanetcheck: run as pcp user, document ICMP Echo socket permissions
           :    pmdaperfevent: fix memory leaks and remove some unused code
           :     - Client tools and utilities:
           :    pcp-atop: update to latest upstream atop v2.5.0 sources
           :    pcp-dstat: no stack trace when dstat exits due to a signal
           :    pcp-dstat: fix handling of very large numbers of disk devices
           :    pcp-dstat: exit with code zero when archive end reached
           :    pmieconf: add a pcp-zeroconf option for enabling all_threads
           :    pmseries: fix pointer arithmetic on series ID set intersection
           :    pmseries: add new option to report all label values by name
           :    pmseries: fix and test regular expression pattern matching
           :     - libpcp, libpcp_pmda, libpcp_mmv, libpcp_web and language bindings
           :    libpcp: fix potential use-after-free in labels code
           :    libpcp_web: fix file descriptor and memory leaks in discovery code
           :    libpcp_web: resolve valgrind issues observed in qa tests
           :    libpcp_pmda: show PM_ERR_VALUE errors only if libpmda debug flag set
           :    python API: add endcontext callback for PMDAs
natoscott commented 4 years ago

@oleksandrborniak looks like its much improved (11G vs 2G in 24h) but some work remains, agreed.

Can you describe your workload? In particular which pmproxy REST API end points you're using, and could you paste a copy of your pmproxy.log file? Thanks.

natoscott commented 4 years ago

Oh wait, I've misread - looks like its much the same and not improved in your case.

oleksandrborniak commented 4 years ago

@natoscott It is test VM , I can give you access if you provide me with your ssh public key.

Can you describe your workload?

It is not under load at all.

[root@fc ~]# while true ; do  systemctl status pmproxy | grep Memory ; sleep 120 ; done
   Memory: 99.1M
   Memory: 116.8M
   Memory: 134.5M
^C
[root@fc ~]#
[root@fc ~]# cat  /var/log/pcp/pmproxy/pmproxy.log
Log for pmproxy on fc.my-clouds.net started Tue Jan  7 09:10:33 2020

pmproxy: PID = 1209, PDU version = 2, user = pcp (983)
pmproxy request port(s):
  sts fd   port  family address
  === ==== ===== ====== =======
  ok    13       unix   /run/pcp/pmproxy.socket
  ok    14 44322 inet   INADDR_ANY
  ok    18 44322 ipv6   INADDR_ANY
  ok    19 44323 inet   INADDR_ANY
  ok    20 44323 ipv6   INADDR_ANY
[Tue Jan  7 09:10:33] pmproxy(1209) Info: OpenSSL 1.1.1d FIPS  10 Sep 2019 - no certificates found
[root@fc ~]#
natoscott commented 4 years ago

@oleksandrborniak I've reproduced this locally now, thanks. It appears to be related to pmlogger archive discovery somehow. I notice in your pmproxy.log that you're not using Redis, so there's no need for that code to be activated at all in your case.

Until its fixed you can workaround the issue by setting enabled=false in the [discover] section of /etc/pcp/pmproxy/pmproxy.conf - with that, I go from the used-memory situation you observed to:

# while true ; do  systemctl status pmproxy | grep Memory ; sleep 120 ; done
   Memory: 3.8M
   Memory: 3.8M
   Memory: 3.8M
   Memory: 3.8M
   Memory: 3.8M
   Memory: 3.8M
oleksandrborniak commented 4 years ago

@natoscott Thank you!

rfenouil commented 2 years ago

Hello,

I think I am facing a memory problem with pmproxy (I use redis though I'm not sure I really need it) On every computer where pmproxy is active (and with a remote grafana interface querying it) I note a steadily increasing memory usage (until crash). In the attached capture, I added a cron command at midnight to reset pmproxy:

0 0 * * * systemctl restart pmproxy.service && systemctl restart pmcd.service &&  systemctl restart pmlogger.service

Screenshot from 2022-03-15 15-54-33

pcp version 5.3.6 Redis server v=6.2.6 sha=00000000:0 malloc=jemalloc-5.1.0 bits=64 build=9c9e426e2f96cc51 Grafana v8.3.6 (bf8766bbf2) Grafana-pcp 2.3.1

natoscott commented 2 years ago

@rfenouil how many remote hosts are you logging there? (pcp(1) output will show it, in the 'pmlogger' section)

| I'm not sure I really need it

Our use of Redis is primarily for the situation where you are using grafana-pcp and recording data locally - the Redis instance indexes timeseries data (for one or many remote hosts) for fast lookups. You can tweak how much data is held in memory using pmproxy.conf settings too, it may be the defaults are too aggressive in your situation.