timescale / prometheus-postgresql-adapter

Use PostgreSQL as a remote storage database for Prometheus
Apache License 2.0
335 stars 66 forks source link

Sawtooth memory usage in PostgreSQL - kernel OOM killer #19

Closed gdhgdhgdh closed 6 years ago

gdhgdhgdh commented 6 years ago

Hi, I've been using timescaledb 0.9.1 for a few days in conjunction with pg_prometheus and the prometheus-postgresql-adapter.

It's working fine - thank you for the software. I just have one concern regarding the memory usage of PostgreSQL 10.3 itself:

image

The server is an AWS t2.medium, so has 3.75GB total RAM. prometheus-postgresql-adapter and the postgres_exporter are the only PostgreSQL clients.

 1444 ?        Ss     0:01 /usr/pgsql-10/bin/postmaster -D /var/lib/pgsql/10/data/
 1446 ?        Ss     0:00  \_ postgres: logger process
15457 ?        Ss     0:01  \_ postgres: checkpointer process
15458 ?        Ss     0:00  \_ postgres: writer process
15459 ?        Ss     0:00  \_ postgres: wal writer process
15460 ?        Ss     0:00  \_ postgres: autovacuum launcher process
15461 ?        Ss     0:01  \_ postgres: stats collector process
15462 ?        Ss     0:00  \_ postgres: bgworker: logical replication launcher
15467 ?        Ss     0:02  \_ postgres: postgres postgres [local] idle
15469 ?        Ss     0:32  \_ postgres: postgres metrics [local] idle
15470 ?        Ss     0:32  \_ postgres: postgres metrics [local] idle
15471 ?        Ss     0:32  \_ postgres: postgres metrics [local] idle
15472 ?        Ss     0:32  \_ postgres: postgres metrics [local] idle
15474 ?        Ss     0:32  \_ postgres: postgres metrics [local] idle
15478 ?        Ss     0:32  \_ postgres: postgres metrics [local] idle
15479 ?        Ss     0:31  \_ postgres: postgres metrics [local] idle
15480 ?        Ss     0:31  \_ postgres: postgres metrics [local] idle
15482 ?        Ss     0:31  \_ postgres: postgres metrics [local] idle
15483 ?        Ss     0:31  \_ postgres: postgres metrics [local] idle
 1749 ?        Ssl   12:15 /usr/bin/prometheus-postgresql-adapter -pg.host=/var/run/postgresql -pg.database=metrics
15463 ?        Ssl    0:03 /opt/monitoring/prometheus-postgresql-exporter/postgres_exporter

PostgreSQL was installed as per TimescaleDB's guide of using PGDG - I have not touched postgresql.conf so shared_buffers is still the default 128MB.

However, something is causing PostgreSQL to consume more and more memory until the kernel kills it...

[May10 22:51] postmaster invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[  +0.005899] postmaster cpuset=/ mems_allowed=0
[  +0.004581] CPU: 1 PID: 13631 Comm: postmaster Kdump: loaded Not tainted 3.10.0-862.el7.x86_64 #1
[  +0.006778] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
[  +0.004775] Call Trace:
[  +0.003090]  [<ffffffffa0d0d768>] dump_stack+0x19/0x1b
[  +0.004531]  [<ffffffffa0d090ea>] dump_header+0x90/0x229
[  +0.004263]  [<ffffffffa08d7c1b>] ? cred_has_capability+0x6b/0x120
[  +0.004667]  [<ffffffffa0797904>] oom_kill_process+0x254/0x3d0

The PostgreSQL logs show a little confusion as expected as it restarts...

2018-05-10 22:51:11.866 UTC [13635] FATAL:  the database system is in recovery mode
2018-05-10 22:51:11.867 UTC [13635] LOG:  could not send data to client: Broken pipe
2018-05-10 22:51:11.892 UTC [1444] LOG:  all server processes terminated; reinitializing
2018-05-10 22:51:11.951 UTC [13637] LOG:  database system was interrupted; last known up at 2018-05-10 22:50:53 UTC
2018-05-10 22:51:11.986 UTC [13637] LOG:  database system was not properly shut down; automatic recovery in progress
2018-05-10 22:51:11.993 UTC [13637] LOG:  redo starts at 2/13444650
2018-05-10 22:51:12.071 UTC [13637] LOG:  invalid record length at 2/147A8520: wanted 24, got 0
2018-05-10 22:51:12.071 UTC [13637] LOG:  redo done at 2/147A84F8
2018-05-10 22:51:12.071 UTC [13637] LOG:  last completed transaction was at log time 2018-05-10 22:51:11.55548+00
2018-05-10 22:51:12.160 UTC [1444] LOG:  database system is ready to accept connections

The machine is 90-95% idle with plenty of CPU credits, and there are literally only 6 machines sending node_exporter stats to the adapter every 15 seconds - super low load!

Help! :)

gdhgdhgdh commented 6 years ago

Closed in favour of https://github.com/timescale/timescaledb/issues/532 - it does seem to be a memory leak when the TimescaleDB extension is used.

gdhgdhgdh commented 6 years ago

Reopening since this is def. related to the adapter. When I restart the adapter, RAM usage on the pgsql server returns to normal levels. I'm currently running the adapter on a 2-hour restart cron :(

image

nbluis commented 6 years ago

I'm having the same problem here.

y2sarakawa commented 6 years ago

timescale/pg_prometheus:0.2 refers timescale/timescaledb:0.9.1. timescale/timescaledb:0.9.1 has a memory leak bug while the connection is alive. It's bug was silently fixed by https://github.com/timescale/timescaledb/pull/516.

What Docker user needs is that timescale/pg_prometheus refers to timescale/timescaledb:0.9.2 and it is tagged and released.

Whether there is a memory leak bug in TimesclaeDB, it is a good idea to have the option to set the maximum amount of time a connection in prometheus-postgresql-adapter.

erimatnor commented 6 years ago

@gdhgdhgdh We've submitted a PR to fix a sessions memory issue in TimescaleDB https://github.com/timescale/timescaledb/pull/575

Can you rerun your tests with this code and see if it fixes your issue?

svenklemm commented 6 years ago

I run into this issue aswell and it seems fixed in https://github.com/timescale/timescaledb/pull/575

red: image from dockerhub yellow: timescaledb master green: PR 575

image

mfreed commented 6 years ago

The 0.10.1 version of timescale has now been released (https://github.com/timescale/timescaledb/releases/tag/0.10.1). We believe / our own testing shows it fix this problem; we'll close out this issue in a few days unless we hear otherwise.

Thanks @gdhgdhgdh !

gdhgdhgdh commented 6 years ago

Awesome, thanks. I wasn't able to test it myself for dull reasons - all the best :)