postgresql_exporter 0.14.0 leaks connections when queried simultaneously

btmc commented 1 year ago

What did you do?

I run postgresql-exporter in an environment with three vmagents scraping the exporter.

It happens that they do it almost simultaneously every time: all three HTTP requests come before the first answer starts to be returned, I see that from tcpdump.

At the exporter side I see multiple 'collector failed' errors every scrape round, on random collector modules.

At postgres side I see the following:

At first round of scrapes, there are 3 new connections in postgres, two of them have 'select version()' as their last query and stay idle, one is functional. At every next round of scrapes, there are 2 new additional connections (previous idle connections remain), which are also idle, the first functional connection continues to be used.

I tried to run exporter version 0.13.2 in the same vmagent setup and it was fine: there are two connections at postgres side, which are being reused.

Also there are no leaks when I make HTTP requests one by one on version 0.14.0.

I guess it might be related to sql.Open call in instance.setup method, which is called on every incoming request in 0.14.0, but only once on collector initialization in 0.13.2.

https://github.com/prometheus-community/postgres_exporter/blob/v0.14.0/collector/instance.go#L46

What did you expect to see?

Postgres connections are correctly handled.

What did you see instead? Under which circumstances?

Postgres connections are used up to the limit.

Environment

System information:

Linux 5.10.0-25-amd64 x86_64

postgres_exporter version:

postgres_exporter, version 0.14.0 (branch: HEAD, revision: c06e57db4e502696ab4e8b8898bb2a59b7b33a59)
  build user:       root@f2337de13240
  build date:       20230920-01:43:49
  go version:       go1.20.8
  platform:         linux/amd64
  tags:             netgo static_build

postgres_exporter flags:
PostgreSQL version:

PostgreSQL 16.0 (Debian 16.0-1.pgdg120+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit

Logs:

ts=2023-09-21T15:09:19.429Z caller=collector.go:199 level=error msg="collector failed" name=database duration_seconds=0.072067468 err="sql: database is closed"
ts=2023-09-21T15:09:19.429Z caller=collector.go:199 level=error msg="collector failed" name=wal duration_seconds=0.055904477 err="sql: database is closed"
ts=2023-09-21T15:09:19.431Z caller=collector.go:199 level=error msg="collector failed" name=database duration_seconds=0.057682597 err="sql: database is closed"
ts=2023-09-21T15:09:29.426Z caller=collector.go:199 level=error msg="collector failed" name=replication_slot duration_seconds=0.067115499 err="sql: database is closed"
ts=2023-09-21T15:09:29.426Z caller=collector.go:199 level=error msg="collector failed" name=locks duration_seconds=0.066662661 err="sql: database is closed"
ts=2023-09-21T15:09:29.429Z caller=collector.go:199 level=error msg="collector failed" name=database duration_seconds=0.069763998 err="sql: database is closed"

sysadmind commented 1 year ago

Are you using the multi target feature (/probe) or the traditional /metrics endpoint?

To clarify, is this multiple systems scraping the exporter, which is connected to a single postgres server? Or are there multiple postgres servers?

btmc commented 1 year ago

Are you using the multi target feature (/probe) or the traditional /metrics endpoint?

I'm using /metrics endpoint.

To clarify, is this multiple systems scraping the exporter, which is connected to a single postgres server? Or are there multiple postgres servers?

Multiple systems are scraping one exporter connected to one postgres server.

weastur commented 1 year ago

Got the same, after upgrading to 0.14.0

nicolaiarocci commented 1 year ago

same here. we went back to 0.13.2 and the open connections are back to normal (we went from 200-ish to 1000-ish as soon as 0.14 went up - yes we do have many dbs.)

sysadmind commented 1 year ago

I think I see the problem now. The instance{} is shared when using /metrics and it's limited to a single connection. I'm working on a fix to clone the instance for each scrape with a separate connection, but it's a bit more tricky to test so it make take a bit of time to work through that.

CarpathianUA commented 1 year ago

Experiencing the same

b-a-t commented 1 year ago

I think I see the problem now. The instance{} is shared when using /metrics and it's limited to a single connection. I'm working on a fix to clone the instance for each scrape with a separate connection, but it's a bit more tricky to test so it make take a bit of time to work through that.

For the moment I thought @btmc is one of my colleagues as we also have 3 vmagents scraping the same exporter 😄 But, to complicate the setup even more - we access postgres_exporter through the exporter_exporter, which is a dedicated reverse proxy for exporters.

So, in our case, I'm not certain that it's easy to distinguish where the scraping connections are coming from. Well, hopefully, connections from proxy are distinct enough:

tcp    ESTAB      0      0      127.0.0.1:9187                 127.0.0.1:45690
tcp    ESTAB      0      0      127.0.0.1:45688                127.0.0.1:9187
tcp    ESTAB      0      0      127.0.0.1:45690                127.0.0.1:9187
tcp    ESTAB      0      0      127.0.0.1:9187                 127.0.0.1:45688

GauntletWizard commented 1 year ago

This was pretty bad - Brought down one of our db servers last night. Any chance you can roll a point release?

Monstrofil commented 1 year ago

@SuperQ this caused downtime on our servers also, when we can expect the release of the fix?

stepanselyuk commented 9 months ago

I'm unsure if that is fixed or not, as well as why it happened twice on our systems, but we use version 0.15, the exporter took 100 and then 500 connections (after updating the connections limit).

postgres_connections_from_127 0 0 1

and the scraping interval is 15 seconds ...

docker images were in use: docker.io/bitnami/postgres-exporter:0.15.0-debian-11-r7 and docker.io/bitnami/postgres-exporter:0.15.0-debian-12-r13 (bitnami/postgresql helm chart in use)

betindex-pg-number-active-connections

that time the exporter stopped issuing metrics, which may be an important thing, but it stopped in "round" time 22:00 UTC and 23:00 UTC.

@sysadmind maybe open a new issue for this?

prometheus-community / postgres_exporter

postgresql_exporter 0.14.0 leaks connections when queried simultaneously #921