percona / pmm

Percona Monitoring and Management: an open source database monitoring, observability and management tool
https://www.percona.com/software/database-tools/percona-monitoring-and-management
GNU Affero General Public License v3.0
682 stars 131 forks source link

pmm-client (docker container) that monitors Postgres bloat memory and process in the container crashes with an OOM #2563

Open Yuskovich opened 1 year ago

Yuskovich commented 1 year ago

Description

pmm-client docker container has limit 32GB RAM and with random period of time,usually half hour, container reaches the RAM usage limit of 32GB and oom kill postgre_exporter inside container. The container is running on the same system, as monitored postgresql.

OS (monitored system): Ubuntu 20.04.4 LTS (Focal Fossa) Linux kernel (monitored system): Linux HOSTNAME_REMOVED 5.4.0-164-generic #181-Ubuntu SMP Fri Sep 1 13:41:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux Docker image of pmm-client: percona/pmm-client: 2.39.0 (same problem on 2.33, 2.36, 2.40.1) PMM servers version: 2.39 (same problem on 2.33, 2.36, 2.40.1) Monitored service: postgresql (14.9) Total RAM on monitored postgresql server: 128Gb

Available memory on the monitored host during reaching limit is about 59GB:

#free -hw
              total        used        free      shared     buffers       cache   available
Mem:          125Gi        31Gi       841Mi        33Gi       387Mi        93Gi        59Gi
Swap:            0B          0B          0B

Expected Results

pmm-client container does not reach limit 32GB RAM

Actual Results

pmm-client container reaches its memory usage limit of 32GB

Version

PMM Server v2.39, PMM client 2.39

Steps to reproduce

Create new postgres cluster Create many schemas (our case is about 10k) Create many empty tables (our case is about 70k) deploy pmm-client in docker and add a PostgreSQL service

Relevant logs

ppm-client docker container logs from start to first reach RAM limit:
INFO[2023-10-20T08:44:51.625+00:00] Run setup: false Sidecar mode: false          component=entrypoint
INFO[2023-10-20T08:44:51.626+00:00] Starting 'pmm-admin run'...                   component=entrypoint
INFO[2023-10-20T08:44:51.726+00:00] Loading configuration file /usr/local/percona/pmm2/config/pmm-agent.yaml.  component=main
INFO[2023-10-20T08:44:51.726+00:00] Using /usr/local/percona/pmm2/exporters/node_exporter  component=main
INFO[2023-10-20T08:44:51.726+00:00] Using /usr/local/percona/pmm2/exporters/mysqld_exporter  component=main
INFO[2023-10-20T08:44:51.726+00:00] Using /usr/local/percona/pmm2/exporters/mongodb_exporter  component=main
INFO[2023-10-20T08:44:51.726+00:00] Using /usr/local/percona/pmm2/exporters/postgres_exporter  component=main
INFO[2023-10-20T08:44:51.726+00:00] Using /usr/local/percona/pmm2/exporters/proxysql_exporter  component=main
INFO[2023-10-20T08:44:51.726+00:00] Using /usr/local/percona/pmm2/exporters/rds_exporter  component=main
INFO[2023-10-20T08:44:51.726+00:00] Using /usr/local/percona/pmm2/exporters/azure_exporter  component=main
INFO[2023-10-20T08:44:51.726+00:00] Using /usr/local/percona/pmm2/exporters/vmagent  component=main
INFO[2023-10-20T08:44:51.726+00:00] Runner capacity set to 32.                    component=runner
INFO[2023-10-20T08:44:51.726+00:00] Loading configuration file /usr/local/percona/pmm2/config/pmm-agent.yaml.  component=main
INFO[2023-10-20T08:44:51.727+00:00] Using /usr/local/percona/pmm2/exporters/node_exporter  component=main
INFO[2023-10-20T08:44:51.727+00:00] Using /usr/local/percona/pmm2/exporters/mysqld_exporter  component=main
INFO[2023-10-20T08:44:51.727+00:00] Using /usr/local/percona/pmm2/exporters/mongodb_exporter  component=main
INFO[2023-10-20T08:44:51.727+00:00] Using /usr/local/percona/pmm2/exporters/postgres_exporter  component=main
INFO[2023-10-20T08:44:51.727+00:00] Using /usr/local/percona/pmm2/exporters/proxysql_exporter  component=main
INFO[2023-10-20T08:44:51.727+00:00] Using /usr/local/percona/pmm2/exporters/rds_exporter  component=main
INFO[2023-10-20T08:44:51.727+00:00] Using /usr/local/percona/pmm2/exporters/azure_exporter  component=main
INFO[2023-10-20T08:44:51.727+00:00] Using /usr/local/percona/pmm2/exporters/vmagent  component=main
ERRO[2023-10-20T08:44:52.995+00:00] ts=2023-10-20T08:44:52.932Z caller=diskstats_linux.go:264 level=error collector=diskstats msg="Failed to open directory, disabling udev device properties" path=/run/udev/data  agentID=/agent_id/8b47212b-ec90-4c24-9a1d-b8c4cc3eaa63 component=agent-process type=node_exporter
ERRO[2023-10-20T09:18:58.502+00:00] ts=2023-10-20T09:18:58.495Z caller=postgres_exporter.go:750 level=error err="Error opening connection to database (postgres://pmm:PASSWORD_REMOVED@HOST_REMOVED:PORT_REMOVED/postgres?connect_timeout=1&sslmode=disable): driver: bad connection"  agentID=/agent_id/ef46e805-0e0a-4246-9b94-d21be2e69ba7 component=agent-process type=postgres_exporter
WARN[2023-10-20T09:38:29.330+00:00] Process: exited: signal: killed.              agentID=/agent_id/ef46e805-0e0a-4246-9b94-d21be2e69ba7 component=agent-process type=postgres_exporter

dmesg log:
[Fri Oct 20 09:38:24 2023] pmm-agent invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[Fri Oct 20 09:38:24 2023] CPU: 0 PID: 2403132 Comm: pmm-agent Not tainted 5.4.0-164-generic #181-Ubuntu
[Fri Oct 20 09:38:24 2023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
[Fri Oct 20 09:38:24 2023] Call Trace:
[Fri Oct 20 09:38:24 2023]  dump_stack+0x6d/0x8b
[Fri Oct 20 09:38:24 2023]  dump_header+0x4f/0x1eb
[Fri Oct 20 09:38:24 2023]  oom_kill_process.cold+0xb/0x10
[Fri Oct 20 09:38:24 2023]  out_of_memory+0x1cf/0x500
[Fri Oct 20 09:38:24 2023]  mem_cgroup_out_of_memory+0xbd/0xe0
[Fri Oct 20 09:38:24 2023]  try_charge+0x77c/0x810
[Fri Oct 20 09:38:24 2023]  mem_cgroup_try_charge+0x71/0x190
[Fri Oct 20 09:38:24 2023]  __add_to_page_cache_locked+0x2ff/0x3f0
[Fri Oct 20 09:38:24 2023]  ? scan_shadow_nodes+0x30/0x30
[Fri Oct 20 09:38:24 2023]  add_to_page_cache_lru+0x4d/0xd0
[Fri Oct 20 09:38:24 2023]  pagecache_get_page+0x101/0x300
[Fri Oct 20 09:38:24 2023]  filemap_fault+0x6b2/0xa50
[Fri Oct 20 09:38:24 2023]  ? unlock_page_memcg+0x12/0x20
[Fri Oct 20 09:38:24 2023]  ? page_add_file_rmap+0xff/0x1a0
[Fri Oct 20 09:38:24 2023]  ? xas_load+0xd/0x80
[Fri Oct 20 09:38:24 2023]  ? xas_find+0x17f/0x1c0
[Fri Oct 20 09:38:24 2023]  ? filemap_map_pages+0x24c/0x380
[Fri Oct 20 09:38:24 2023]  ext4_filemap_fault+0x32/0x50
[Fri Oct 20 09:38:24 2023]  __do_fault+0x3c/0x170
[Fri Oct 20 09:38:24 2023]  do_fault+0x24b/0x640
[Fri Oct 20 09:38:24 2023]  __handle_mm_fault+0x4c5/0x7a0
[Fri Oct 20 09:38:24 2023]  handle_mm_fault+0xca/0x200
[Fri Oct 20 09:38:24 2023]  do_user_addr_fault+0x1f9/0x450
[Fri Oct 20 09:38:24 2023]  __do_page_fault+0x58/0x90
[Fri Oct 20 09:38:24 2023]  do_page_fault+0x2c/0xe0
[Fri Oct 20 09:38:24 2023]  do_async_page_fault+0x39/0x70
[Fri Oct 20 09:38:24 2023]  async_page_fault+0x34/0x40
[Fri Oct 20 09:38:24 2023] RIP: 0033:0x43730f
[Fri Oct 20 09:38:24 2023] Code: Bad RIP value.
[Fri Oct 20 09:38:24 2023] RSP: 002b:00007f8f94ff84f8 EFLAGS: 00010206
[Fri Oct 20 09:38:24 2023] RAX: ffffffffffffff92 RBX: 0000000000000000 RCX: 0000000000473d63
[Fri Oct 20 09:38:24 2023] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 000000000236c030
[Fri Oct 20 09:38:24 2023] RBP: 00007f8f94ff8538 R08: 0000000000000000 R09: 0000000000000000
[Fri Oct 20 09:38:24 2023] R10: 00007f8f94ff8528 R11: 0000000000000206 R12: 00007f8f94ff8528
[Fri Oct 20 09:38:24 2023] R13: 0000000000000013 R14: 000000c0001036c0 R15: 000000c000452000
[Fri Oct 20 09:38:24 2023] memory: usage 33554432kB, limit 33554432kB, failcnt 640017
[Fri Oct 20 09:38:24 2023] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[Fri Oct 20 09:38:24 2023] kmem: usage 90612kB, limit 9007199254740988kB, failcnt 0
[Fri Oct 20 09:38:24 2023] Memory cgroup stats for /docker/cc818c9a63d41a2755ed35aa07d5d06c357b5512972fc76d74941c1983731f9a:
[Fri Oct 20 09:38:24 2023] anon 34263621632
                           file 1167360
                           kernel_stack 1216512
                           slab 20291584
                           sock 0
                           shmem 0
                           file_mapped 0
                           file_dirty 0
                           file_writeback 0
                           anon_thp 4464836608
                           inactive_anon 0
                           active_anon 34263457792
                           inactive_file 0
                           active_file 0
                           unevictable 0
                           slab_reclaimable 11001856
                           slab_unreclaimable 9289728
                           pgfault 8587986
                           pgmajfault 71247
                           workingset_refault 1611621
                           workingset_activate 115170
                           workingset_nodereclaim 0
                           pgrefill 723458
                           pgscan 8194323
                           pgsteal 1630807
                           pgactivate 480414
                           pgdeactivate 599413
                           pglazyfree 0
                           pglazyfreed 0
                           thp_fault_alloc 1617
                           thp_collapse_alloc 0
[Fri Oct 20 09:38:24 2023] Tasks state (memory values in pages):
[Fri Oct 20 09:38:24 2023] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[Fri Oct 20 09:38:24 2023] [2403075]  1002 2403075   178251      210    86016        0             0 pmm-agent-entry
[Fri Oct 20 09:38:24 2023] [2403121]  1002 2403121   349731     2272   274432        0             0 pmm-agent
[Fri Oct 20 09:38:24 2023] [2403137]  1002 2403137   180996     5064   163840        0             0 vmagent
[Fri Oct 20 09:38:24 2023] [2403139]  1002 2403139   181981     2540   159744        0             0 node_exporter
[Fri Oct 20 09:38:24 2023] [2403156]  1002 2403156  8558644  8354434 67321856        0             0 postgres_export
[Fri Oct 20 09:38:24 2023] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cc818c9a63d41a2755ed35aa07d5d06c357b5512972fc76d74941c1983731f9a,mems_allowed=0,oom_memcg=/docker/cc818c9a63d41a2755ed35aa07d5d06c357b5512972fc76d74941c1983731f9a,task_memcg=/docker/cc818c9a63d41a2755ed35aa07d5d06c357b5512972fc76d74941c1983731f9a,task=postgres_export,pid=2403156,uid=1002
[Fri Oct 20 09:38:24 2023] Memory cgroup out of memory: Killed process 2403156 (postgres_export) total-vm:34234576kB, anon-rss:33417736kB, file-rss:0kB, shmem-rss:0kB, UID:1002 pgtables:65744kB oom_score_adj:0
[Fri Oct 20 09:38:28 2023] oom_reaper: reaped process 2403156 (postgres_export), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Code of Conduct

BupycHuk commented 1 year ago

Hello @Yuskovich, we are working on fixing this problem and we released some improvements in PMM 2.40.1 and going to provide more improvements in PMM 2.41.0. Please upgrade to 2.40.1 and provide feedback if it helped you.

Yuskovich commented 1 year ago

Hello @BupycHuk, we have updated server and agent to version 2.40.1. Issue still persists. And now we found way to reproduce:

BupycHuk commented 1 year ago

got it, thank you. Please wait for 2.41.0, it should be fixed in upcoming release.

BupycHuk commented 1 year ago

@Yuskovich what about database number?