percona / pmm

Percona Monitoring and Management: an open source database monitoring, observability and management tool
https://www.percona.com/software/database-tools/percona-monitoring-and-management
GNU Affero General Public License v3.0
680 stars 131 forks source link

pmm-agent.service causes `Too many open files` crash on MongoDB and memory leaks on MariaDB #3262

Open Bobzemob opened 1 month ago

Bobzemob commented 1 month ago

Description

MongoDB

pmm-agent.service on MongoDB cluster opened 40,000+ pages, causing MongoDB to crash with Too many open files error. This behavior occured on a MongoDB server cluster running version 6.0.18.

MariaDB

The pmm-agent.service on several MariaDB servers caused errors that coincided with the MySQL process grabbing multiple GB of memory and not releasing it until the MySQL process was restarted. After the MySQL process was restarted, the memory grabbing behavior continued on severs that had the pmm-agent.service active and stopped on the ones that had it disabled. (See logs below) Affected servers were using MariaDB version 10.3.39.

Expected Results

Expected pmm agent to not cause crashing/memory leaks.

Actual Results

PMM agent opened too many pages on MongoDB servers causing the MongoDB process to crash. Disabling pmm-agent caused the number of open pages to drop from 40,000+ to around 500.

PMM agent caused MariaDB to allocate multiple GBs of memory and not releasing said memory, eventually leading to a crash. Disabling the pmm-agent service stopped this behavior from occurring.

Version

PMM Server 2.43.1 PMM Agent 2.43.1-6

Steps to reproduce

No response

Relevant logs

------- MariaDB-DB1 -------

Oct 19 20:26:09 mariadb-1 pmm-agent[7740]: time="2024-10-19T20:26:09.537-04:00" level=warning msg="Action terminated with error: Error 1193 (HY000): Unknown system variable 'binlog_expire_logs_seconds
'\ngithub.com/percona/pmm/agent/runner/actions.(*mysqlQuerySelectAction).Run\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/actions/mysql_query_select_action.go:75\ngithub.com/percona/pmm/agent/runn
er.(*Runner).handleAction.func1\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/runner.go:382\nruntime/pprof.Do\n\t/usr/local/go/src/runtime/pprof/runtime.go:51\nruntime.goexit\n\t/usr/local/go/src/r
untime/asm_amd64.s:1695" component=runner id=/action_id/3ee7b8a6-c981-4a8c-8588-9386a2f2f49d type=mysql-query-select

Oct 19 20:26:48 mariadb-2 pmm-agent[7740]: time="2024-10-19T20:26:48.992-04:00" level=warning msg="Action terminated with error: Error 1146 (42S02): Table 'performance_schema.global_variables' doesn't
 exist\ngithub.com/percona/pmm/agent/runner/actions.(*mysqlQuerySelectAction).Run\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/actions/mysql_query_select_action.go:75\ngithub.com/percona/pmm/agent
/runner.(*Runner).handleAction.func1\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/runner.go:382\nruntime/pprof.Do\n\t/usr/local/go/src/runtime/pprof/runtime.go:51\nruntime.goexit\n\t/usr/local/go/
src/runtime/asm_amd64.s:1695" component=runner id=/action_id/c5a00902-0466-48bd-9737-e96481ba336a type=mysql-query-select

Oct 19 20:27:08 mariadb-1 pmm-agent[7740]: time="2024-10-19T20:27:08.295-04:00" level=warning msg="Action terminated with error: Error 1193 (HY000): Unknown system variable 'default_password_lifetime'
\ngithub.com/percona/pmm/agent/runner/actions.(*mysqlQuerySelectAction).Run\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/actions/mysql_query_select_action.go:75\ngithub.com/percona/pmm/agent/runne
r.(*Runner).handleAction.func1\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/runner.go:382\nruntime/pprof.Do\n\t/usr/local/go/src/runtime/pprof/runtime.go:51\nruntime.goexit\n\t/usr/local/go/src/ru
ntime/asm_amd64.s:1695" component=runner id=/action_id/872cc710-2347-406a-9ff4-2dc860098e1d type=mysql-query-select

Oct 19 20:27:08 maraidb-1 pmm-agent[7740]: time="2024-10-19T20:27:08.303-04:00" level=warning msg="Action terminated with error: Error 1193 (HY000): Unknown system variable 'default_password_lifetime'
\ngithub.com/percona/pmm/agent/runner/actions.(*mysqlQuerySelectAction).Run\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/actions/mysql_query_select_action.go:75\ngithub.com/percona/pmm/agent/runne
r.(*Runner).handleAction.func1\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/runner.go:382\nruntime/pprof.Do\n\t/usr/local/go/src/runtime/pprof/runtime.go:51\nruntime.goexit\n\t/usr/local/go/src/ru
ntime/asm_amd64.s:1695" component=runner id=/action_id/4c68b175-81ee-477a-9843-c8bca1b99c1b type=mysql-query-select

------- MariaDB-DB2 -------

Oct 19 20:26:03 wc-demo-db2 pmm-agent[1061]: time="2024-10-19T20:26:03.262-04:00" level=warning msg="Action terminated with error: Error 1193 (HY000): Unknown system variable 'binlog_expire_logs_seconds
'\ngithub.com/percona/pmm/agent/runner/actions.(*mysqlQuerySelectAction).Run\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/actions/mysql_query_select_action.go:75\ngithub.com/percona/pmm/agent/runn
er.(*Runner).handleAction.func1\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/runner.go:382\nruntime/pprof.Do\n\t/usr/local/go/src/runtime/pprof/runtime.go:51\nruntime.goexit\n\t/usr/local/go/src/r
untime/asm_amd64.s:1695" component=runner id=/action_id/edf95dab-c4af-4f03-9ffb-aa33a4a0b7dc type=mysql-query-select

Oct 19 20:26:42 wc-demo-db2 pmm-agent[1061]: time="2024-10-19T20:26:42.635-04:00" level=warning msg="Action terminated with error: Error 1146 (42S02): Table 'performance_schema.global_variables' doesn't
 exist\ngithub.com/percona/pmm/agent/runner/actions.(*mysqlQuerySelectAction).Run\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/actions/mysql_query_select_action.go:75\ngithub.com/percona/pmm/agent
/runner.(*Runner).handleAction.func1\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/runner.go:382\nruntime/pprof.Do\n\t/usr/local/go/src/runtime/pprof/runtime.go:51\nruntime.goexit\n\t/usr/local/go/
src/runtime/asm_amd64.s:1695" component=runner id=/action_id/8881d5c2-e578-456c-8743-6b3a7078a7c7 type=mysql-query-select

Oct 19 20:27:01 wc-demo-db2 pmm-agent[1061]: time="2024-10-19T20:27:01.915-04:00" level=warning msg="Action terminated with error: Error 1193 (HY000): Unknown system variable 'default_password_lifetime'
\ngithub.com/percona/pmm/agent/runner/actions.(*mysqlQuerySelectAction).Run\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/actions/mysql_query_select_action.go:75\ngithub.com/percona/pmm/agent/runne
r.(*Runner).handleAction.func1\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/runner.go:382\nruntime/pprof.Do\n\t/usr/local/go/src/runtime/pprof/runtime.go:51\nruntime.goexit\n\t/usr/local/go/src/ru
ntime/asm_amd64.s:1695" component=runner id=/action_id/a4bca5b4-b0a7-412c-985f-2be52b6f1cbc type=mysql-query-select

Oct 19 20:27:01 wc-demo-db2 pmm-agent[1061]: time="2024-10-19T20:27:01.923-04:00" level=warning msg="Action terminated with error: Error 1193 (HY000): Unknown system variable 'default_password_lifetime'
\ngithub.com/percona/pmm/agent/runner/actions.(*mysqlQuerySelectAction).Run\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/actions/mysql_query_select_action.go:75\ngithub.com/percona/pmm/agent/runne
r.(*Runner).handleAction.func1\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/runner.go:382\nruntime/pprof.Do\n\t/usr/local/go/src/runtime/pprof/runtime.go:51\nruntime.goexit\n\t/usr/local/go/src/ru
ntime/asm_amd64.s:1695" component=runner id=/action_id/374783d2-aeec-495d-a68b-9a521217d66c type=mysql-query-select

------- MariaDB-DB3 -------

Oct 19 20:17:32 wc-demo-db3 pmm-agent[930]: time="2024-10-19T20:17:32.657-04:00" level=warning msg="Action terminated with error: Error 1146 (42S02): Table 'performance_schema.replication_connection_con
figuration' doesn't exist\ngithub.com/percona/pmm/agent/runner/actions.(*mysqlQuerySelectAction).Run\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/actions/mysql_query_select_action.go:75\ngithub.co
m/percona/pmm/agent/runner.(*Runner).handleAction.func1\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/runner.go:382\nruntime/pprof.Do\n\t/usr/local/go/src/runtime/pprof/runtime.go:51\nruntime.goexi
t\n\t/usr/local/go/src/runtime/asm_amd64.s:1695" component=runner id=/action_id/403df0eb-ac90-419d-9c38-0bf8e7d1e261 type=mysql-query-select

Oct 19 20:22:34 wc-demo-db3 pmm-agent[930]: time="2024-10-19T20:22:34.654-04:00" level=warning msg="Action terminated with error: context deadline exceeded\ngithub.com/percona/pmm/agent/runner/actions.(
*mysqlQuerySelectAction).Run\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/actions/mysql_query_select_action.go:81\ngithub.com/percona/pmm/agent/runner.(*Runner).handleAction.func1\n\t/tmp/go/src/g
ithub.com/percona/pmm/agent/runner/runner.go:382\nruntime/pprof.Do\n\t/usr/local/go/src/runtime/pprof/runtime.go:51\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695" component=runner id=/ac
tion_id/8b49d229-5348-4ed3-8c79-12aa241e8dfa type=mysql-query-select

Oct 19 20:26:13 wc-demo-db3 pmm-agent[930]: time="2024-10-19T20:26:13.820-04:00" level=warning msg="Action terminated with error: Error 1193 (HY000): Unknown system variable 'binlog_expire_logs_seconds'\ngithub.com/percona/pmm/agent/runner/actions.(*mysqlQuerySelectAction).Run\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/actions/mysql_query_select_action.go:75\ngithub.com/percona/pmm/agent/runner.(*Runner).handleAction.func1\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/runner.go:382\nruntime/pprof.Do\n\t/usr/local/go/src/runtime/pprof/runtime.go:51\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695" component=runner id=/action_id/01bdc6f9-7c8b-4045-b495-d27e255656d8 type=mysql-query-select

Oct 19 20:26:53 wc-demo-db3 pmm-agent[930]: time="2024-10-19T20:26:53.193-04:00" level=warning msg="Action terminated with error: Error 1146 (42S02): Table 'performance_schema.global_variables' doesn't exist\ngithub.com/percona/pmm/agent/runner/actions.(*mysqlQuerySelectAction).Run\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/actions/mysql_query_select_action.go:75\ngithub.com/percona/pmm/agent/runner.(*Runner).handleAction.func1\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/runner.go:382\nruntime/pprof.Do\n\t/usr/local/go/src/runtime/pprof/runtime.go:51\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695" component=runner id=/action_id/e8d6680c-8af9-4cf5-8cfb-7f7f0ce07e57 type=mysql-query-select

Oct 19 20:27:12 wc-demo-db3 pmm-agent[930]: time="2024-10-19T20:27:12.498-04:00" level=warning msg="Action terminated with error: Error 1193 (HY000): Unknown system variable 'default_password_lifetime'\ngithub.com/percona/pmm/agent/runner/actions.(*mysqlQuerySelectAction).Run\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/actions/mysql_query_select_action.go:75\ngithub.com/percona/pmm/agent/runner.(*Runner).handleAction.func1\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/runner.go:382\nruntime/pprof.Do\n\t/usr/local/go/src/runtime/pprof/runtime.go:51\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695" component=runner id=/action_id/fff68a5f-1185-441d-944a-94eaa8521a55 type=mysql-query-select

Oct 19 20:27:12 wc-demo-db3 pmm-agent[930]: time="2024-10-19T20:27:12.506-04:00" level=warning msg="Action terminated with error: Error 1193 (HY000): Unknown system variable 'default_password_lifetime'\ngithub.com/percona/pmm/agent/runner/actions.(*mysqlQuerySelectAction).Run\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/actions/mysql_query_select_action.go:75\ngithub.com/percona/pmm/agent/runner.(*Runner).handleAction.func1\n\t/tmp/go/src/github.com/percona/pmm/agent/runner/runner.go:382\nruntime/pprof.Do\n\t/usr/local/go/src/runtime/pprof/runtime.go:51\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695" component=runner id=/action_id/80935cc1-dc5a-4352-b8b3-c3d6b9894e98 type=mysql-query-select

------- MongoDB-DB1 -------

Oct 20 23:28:08 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:08.156+00:00" level=error msg="couldn't create system.profile iterator, reason: server selection error: server selection timeout, c
urrent topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp [IP_ADDRESS]: connect: connection refused }, ] }" agentID=/agent_id/[AGENT_ID]
 component=agent-builtin db=local type=qan_mongodb_profiler_agent
Oct 20 23:28:08 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:08.156+00:00" level=error msg="couldn't create system.profile iterator, reason: server selection error: server selection timeout, c
urrent topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp [IP_ADDRESS]: connect: connection refused }, ] }" agentID=/agent_id/[AGENT_ID]
 component=agent-builtin db=DatabaseProd type=qan_mongodb_profiler_agent
Oct 20 23:28:08 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:08.156+00:00" level=error msg="couldn't create system.profile iterator, reason: server selection error: server selection timeout, c
urrent topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp [IP_ADDRESS]: connect: connection refused }, ] }" agentID=/agent_id/[AGENT_ID]
 component=agent-builtin db=Database type=qan_mongodb_profiler_agent
Oct 20 23:28:09 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:09.479+00:00" level=error msg="time=\"2024-10-20T23:28:09Z\" level=error msg=\"Registry - Cannot get node type to check if this is
a mongos : server selection error: server selection timeout, current topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp [IP_ADDRESS]: c
onnect: connection refused }, ] }\"" agentID=/agent_id/[AGENT_ID] component=agent-process type=mongodb_exporter
Oct 20 23:28:10 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:10.010+00:00" level=error msg="time=\"2024-10-20T23:28:10Z\" level=error msg=\"Registry - Cannot get node type to check if this is
a mongos : server selection error: server selection timeout, current topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp [IP_ADDRESS]: c
onnect: connection refused }, ] }\"" agentID=/agent_id/[AGENT_ID] component=agent-process type=mongodb_exporter
Oct 20 23:28:10 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:10.157+00:00" level=error msg="couldn't create system.profile iterator, reason: server selection error: server selection timeout, c
urrent topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp [IP_ADDRESS]: connect: connection refused }, ] }" agentID=/agent_id/[AGENT_ID]
 component=agent-builtin db=DatabaseProd type=qan_mongodb_profiler_agent
Oct 20 23:28:10 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:10.157+00:00" level=error msg="couldn't create system.profile iterator, reason: server selection error: server selection timeout, c
urrent topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp [IP_ADDRESS]: connect: connection refused }, ] }" agentID=/agent_id/[AGENT_ID]
 component=agent-builtin db=Database type=qan_mongodb_profiler_agent
Oct 20 23:28:10 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:10.157+00:00" level=error msg="couldn't create system.profile iterator, reason: server selection error: server selection timeout, c
urrent topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp [IP_ADDRESS]: connect: connection refused }, ] }" agentID=/agent_id/[AGENT_ID]
 component=agent-builtin db=local type=qan_mongodb_profiler_agent
Oct 20 23:28:10 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:10.384+00:00" level=error msg="time=\"2024-10-20T23:28:10Z\" level=error msg=\"error while checking mongodb connection: server sele
ction error: context canceled, current topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp [IP_ADDRESS]: connect: connection refused },
] }. mongo_up is set to 0\" collector=general" agentID=/agent_id/[AGENT_ID] component=agent-process type=mongodb_exporter
Oct 20 23:28:10 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:10.401+00:00" level=error msg="time=\"2024-10-20T23:28:10Z\" level=error msg=\"Cannot get node type: server selection error: contex
t canceled, current topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp [IP_ADDRESS]: connect: connection refused }, ] }\" component=dia
gnosticDataCollector" agentID=/agent_id/[AGENT_ID] component=agent-process type=mongodb_exporter
Oct 20 23:28:10 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:10.477+00:00" level=info msg="2024-10-20T23:28:10.476Z\twarn\tVictoriaMetrics/lib/promscrape/scrapework.go:387\tcannot scrape targe
t \"http://[IP_ADDRESS]/metrics?collect%5B%5D=diagnosticdata&collect%5B%5D=replicasetstatus&collect%5B%5D=topmetrics\" ({agent_id=\"/agent_id/[AGENT_ID]\",agent_type=\"mongo
db_exporter\",cluster=\"[CLUSTER_NAME]\",instance=\"/agent_id/[AGENT_ID]\",job=\"mongodb_exporter_agent_id_[AGENT_ID]_hr\",machin..node_id=\"/node_id/c0273607-9363-4359-b2f9-d29b6ffc082d\",node_name=\"mongo-db1\",node_type=\"generic\",replication_set=\"[REP_SET]\",service_id=\"/service_id/[SERVICE_ID]\",service_name=\"mongo-db1-mongodb\",service_type=\"mongodb\"}) 1 out of 1 times during -promscrape.suppressScrapeErrorsDelay=0s; the last error: error when scraping \"http://[IP_ADDRESS]/metrics?collect%5B%5D=diagnosticdata&collect%5B%5D=replicasetstatus&collect%5B%5D=topmetrics\" with timeout 4.5s: timeout" agentID=/agent_id/[AGENT_ID] component=agent-process type=vm_agent
Oct 20 23:28:10 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:10.478+00:00" level=error msg="time=\"2024-10-20T23:28:10Z\" level=error msg=\"error while checking mongodb connection: server selection error: context canceled, current topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp 172.27.238.77:27017: connect: connection refused }, ] }. mongo_up is set to 0\" collector=general" agentID=/agent_id/[AGENT_ID] component=agent-process type=mongodb_exporter
Oct 20 23:28:10 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:10.478+00:00" level=error msg="time=\"2024-10-20T23:28:10Z\" level=error msg=\"Cannot get node type: server selection error: context canceled, current topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp 172.27.238.77:27017: connect: connection refused }, ] }\" component=diagnosticDataCollector" agentID=/agent_id/[AGENT_ID] component=agent-process type=mongodb_exporter
Oct 20 23:28:10 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:10.972+00:00" level=error msg="time=\"2024-10-20T23:28:10Z\" level=error msg=\"Registry - Cannot get node type to check if this is a mongos : server selection error: server selection timeout, current topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp 172.27.238.77:27017: connect: connection refused }, ] }\"" agentID=/agent_id/[AGENT_ID] component=agent-process type=mongodb_exporter
Oct 20 23:28:11 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:11.972+00:00" level=error msg="time=\"2024-10-20T23:28:11Z\" level=error msg=\"error while checking mongodb connection: server selection error: server selection timeout, current topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp 172.27.238.77:27017: connect: connection refused }, ] }. mongo_up is set to 0\" collector=general" agentID=/agent_id/[AGENT_ID] component=agent-process type=mongodb_exporter
Oct 20 23:28:12 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:12.158+00:00" level=error msg="couldn't create system.profile iterator, reason: server selection error: server selection timeout, current topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp [IP_ADDRESS]: connect: connection refused }, ] }" agentID=/agent_id/[AGENT_ID] component=agent-builtin db=DatabaseProd type=qan_mongodb_profiler_agent
Oct 20 23:28:12 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:12.158+00:00" level=error msg="couldn't create system.profile iterator, reason: server selection error: server selection timeout, current topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp [IP_ADDRESS]: connect: connection refused }, ] }" agentID=/agent_id/[AGENT_ID] component=agent-builtin db=local type=qan_mongodb_profiler_agent
Oct 20 23:28:12 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:12.158+00:00" level=error msg="couldn't create system.profile iterator, reason: server selection error: server selection timeout, current topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp [IP_ADDRESS]: connect: connection refused }, ] }" agentID=/agent_id/[AGENT_ID] component=agent-builtin db=Database type=qan_mongodb_profiler_agent
Oct 20 23:28:30 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:30.460+00:00" level=error msg="time=\"2024-10-20T23:28:30Z\" level=error msg=\"Registry - Cannot get node type to check if this is
a mongos : server selection error: server selection timeout, current topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp 172.27.238.77:27017: c
onnect: connection refused }, ] }\"" agentID=/agent_id/[AGENT_ID] component=agent-process type=mongodb_exporter
Oct 20 23:28:30 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:30.477+00:00" level=info msg="2024-10-20T23:28:30.477Z\twarn\tVictoriaMetrics/lib/promscrape/scrapework.go:387\tcannot scrape targe
t \"http://[IP_ADDRESS]/metrics?collect%5B%5D=diagnosticdata&collect%5B%5D=replicasetstatus&collect%5B%5D=topmetrics\" ({agent_id=\"/agent_id/[AGENT_ID]\",agent_type=\"mongo
db_exporter\",cluster=\"[CLUSTER_NAME]\",instance=\"/agent_id/[AGENT_ID]\",job=\"mongodb_exporter_agent_id_[AGENT_ID]_hr\",machin..node_id=\"/node_id/c
0273607-9363-4359-b2f9-d29b6ffc082d\",node_name=\"mongo-db1\",node_type=\"generic\",replication_set=\"[REP_SET]\",service_id=\"/service_id/16332cef-07e6-424e-b998-7aa6884471ba\",service_name=\"[SERVICE_NAME]
\",service_type=\"mongodb\"}) 1 out of 1 times during -promscrape.suppressScrapeErrorsDelay=0s; the last error: error when scraping \"http://127.0.0.1:42000/metrics?collect%5B%5D=diagnostic
data&collect%5B%5D=replicasetstatus&collect%5B%5D=topmetrics\" with timeout 4.5s: timeout" agentID=/agent_id/[AGENT_ID] component=agent-process type=vm_agent
Oct 20 23:28:30 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:30.656+00:00" level=error msg="time=\"2024-10-20T23:28:30Z\" level=error msg=\"error while checking mongodb connection: server sele
ction error: context canceled, current topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp [IP_ADDRESS]: connect: connection refused },
] }. mongo_up is set to 0\" collector=general" agentID=/agent_id/[AGENT_ID] component=agent-process type=mongodb_exporter
Oct 20 23:28:30 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:30.673+00:00" level=error msg="time=\"2024-10-20T23:28:30Z\" level=error msg=\"error while checking mongodb connection: server sele
ction error: context canceled, current topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp [IP_ADDRESS]: connect: connection refused },
] }. mongo_up is set to 0\" collector=general" agentID=/agent_id/[AGENT_ID] component=agent-process type=mongodb_exporter
Oct 20 23:28:30 mongo-db1 pmm-agent[4189141]: time="2024-10-20T23:28:30.690+00:00" level=error msg="time=\"2024-10-20T23:28:30Z\" level=error msg=\"Cannot get node type: server selection error: contex
t canceled, current topology: { Type: Single, Servers: [{ Addr: mongo-db1.com:27017, Type: Unknown, Last error: dial tcp 172.27.238.77:27017: connect: connection refused }, ] }\" component=dia
gnosticDataCollector" agentID=/agent_id/[AGENT_ID] component=agent-process type=mongodb_exporter

Code of Conduct

BupycHuk commented 1 month ago

Hi, we are working on MongoDB memory leak. Regarding MariaDB we need to investigate cause of that problem.

wreiske commented 4 weeks ago

This is a major issue for us, causing multiple replicaset members to crash. We had to stop pmm-agent on all of our mongodb replicaset servers due to this bug.

image

image

Also seeing this on MariaDB image

BupycHuk commented 4 weeks ago

@wreiske we are fixing issue with Mongodb in 2.43.2 and releasing it today. Please recheck MariaDB after upgrade. If problem persists, please create a task in our jira.percona.com.

wreiske commented 3 weeks ago
Get:3 http://repo.percona.com/percona/apt bullseye/main amd64 pmm2-client amd64 2.42.0-6.bullseye [88.0 MB]

Just ran an apt update, still latest available is:

pmm-agent --version
ProjectName: pmm-agent
Version: 2.42.0
PMMVersion: 2.42.0
Timestamp: 2024-06-06 15:28:56 (UTC)
FullCommit: 74e57527735bd062c4bd37adbd89c31bb14ebc15

I'll check back later today.