open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.02k stars 2.34k forks source link

receiver/postgresql: improve connection management #30831

Closed kevinnoel-be closed 7 months ago

kevinnoel-be commented 8 months ago

Component(s)

receiver/postgresql

Is your feature request related to a problem? Please describe.

We've been trying to enable the postgresql receiver on our fleet but did encounter a "spam log" issue due to how this receiver manages the connection i.e. connect/disconnect on each scrape per database. All of our PostgreSQL are configured, for security/compliance reasons, with log_connections=on & log_disconnections=on which makes this receiver very verbose, especially at this scale (2000+ PGs).

For example, here is an excerpt of PG logs generated during one basic test scrape:

PG logs ``` 2024-01-29 17:09:50.965 UTC [55] LOG: connection received: host=172.18.0.1 port=33724 2024-01-29T17:09:50.984870001Z 2024-01-29 17:09:50.984 UTC [55] LOG: connection authenticated: identity="otelu" method=scram-sha-256 (/var/lib/postgresql/data/pg_hba.conf:100) 2024-01-29T17:09:50.984992982Z 2024-01-29 17:09:50.984 UTC [55] LOG: connection authorized: user=otelu database=postgres 2024-01-29T17:09:50.993220875Z 2024-01-29 17:09:50.991 UTC [56] LOG: connection received: host=172.18.0.1 port=33734 2024-01-29T17:09:50.993247245Z 2024-01-29 17:09:50.992 UTC [57] LOG: connection received: host=172.18.0.1 port=33738 2024-01-29T17:09:50.999776857Z 2024-01-29 17:09:50.999 UTC [56] LOG: connection authenticated: identity="otelu" method=scram-sha-256 (/var/lib/postgresql/data/pg_hba.conf:100) 2024-01-29T17:09:50.999804090Z 2024-01-29 17:09:50.999 UTC [56] LOG: connection authorized: user=otelu database=postgres 2024-01-29T17:09:51.004496467Z 2024-01-29 17:09:51.004 UTC [57] LOG: connection authenticated: identity="otelu" method=scram-sha-256 (/var/lib/postgresql/data/pg_hba.conf:100) 2024-01-29T17:09:51.004515817Z 2024-01-29 17:09:51.004 UTC [57] LOG: connection authorized: user=otelu database=postgres 2024-01-29T17:09:51.008993235Z 2024-01-29 17:09:51.008 UTC [57] LOG: disconnection: session time: 0:00:00.016 user=otelu database=postgres host=172.18.0.1 port=33738 2024-01-29T17:09:51.010942781Z 2024-01-29 17:09:51.010 UTC [58] LOG: connection received: host=172.18.0.1 port=33740 2024-01-29T17:09:51.015009416Z 2024-01-29 17:09:51.014 UTC [58] LOG: connection authenticated: identity="otelu" method=scram-sha-256 (/var/lib/postgresql/data/pg_hba.conf:100) 2024-01-29T17:09:51.015032320Z 2024-01-29 17:09:51.014 UTC [58] LOG: connection authorized: user=otelu database=postgres 2024-01-29T17:09:51.027208786Z 2024-01-29 17:09:51.027 UTC [59] LOG: connection received: host=172.18.0.1 port=33750 2024-01-29T17:09:51.031809257Z 2024-01-29 17:09:51.031 UTC [59] LOG: connection authenticated: identity="otelu" method=scram-sha-256 (/var/lib/postgresql/data/pg_hba.conf:100) 2024-01-29T17:09:51.031827320Z 2024-01-29 17:09:51.031 UTC [59] LOG: connection authorized: user=otelu database=otel2 2024-01-29T17:09:51.057199872Z 2024-01-29 17:09:51.057 UTC [59] LOG: disconnection: session time: 0:00:00.030 user=otelu database=otel2 host=172.18.0.1 port=33750 2024-01-29T17:09:51.057529056Z 2024-01-29 17:09:51.057 UTC [56] LOG: disconnection: session time: 0:00:00.066 user=otelu database=postgres host=172.18.0.1 port=33734 2024-01-29T17:09:51.057537967Z 2024-01-29 17:09:51.057 UTC [58] LOG: disconnection: session time: 0:00:00.046 user=otelu database=postgres host=172.18.0.1 port=33740 2024-01-29T17:09:51.059634469Z 2024-01-29 17:09:51.057 UTC [55] LOG: disconnection: session time: 0:00:00.093 user=otelu database=postgres host=172.18.0.1 port=33724 2024-01-29T17:10:00.965055493Z 2024-01-29 17:10:00.964 UTC [61] LOG: connection received: host=172.18.0.1 port=60680 2024-01-29T17:10:00.969273549Z 2024-01-29 17:10:00.969 UTC [61] LOG: connection authenticated: identity="otelu" method=scram-sha-256 (/var/lib/postgresql/data/pg_hba.conf:100) 2024-01-29T17:10:00.969294007Z 2024-01-29 17:10:00.969 UTC [61] LOG: connection authorized: user=otelu database=postgres 2024-01-29T17:10:00.973458607Z 2024-01-29 17:10:00.973 UTC [62] LOG: connection received: host=172.18.0.1 port=60696 2024-01-29T17:10:00.974047716Z 2024-01-29 17:10:00.973 UTC [63] LOG: connection received: host=172.18.0.1 port=60700 2024-01-29T17:10:00.977926682Z 2024-01-29 17:10:00.977 UTC [62] LOG: connection authenticated: identity="otelu" method=scram-sha-256 (/var/lib/postgresql/data/pg_hba.conf:100) 2024-01-29T17:10:00.977940513Z 2024-01-29 17:10:00.977 UTC [62] LOG: connection authorized: user=otelu database=postgres 2024-01-29T17:10:00.980855639Z 2024-01-29 17:10:00.980 UTC [63] LOG: connection authenticated: identity="otelu" method=scram-sha-256 (/var/lib/postgresql/data/pg_hba.conf:100) 2024-01-29T17:10:00.980869061Z 2024-01-29 17:10:00.980 UTC [63] LOG: connection authorized: user=otelu database=postgres 2024-01-29T17:10:00.987216600Z 2024-01-29 17:10:00.985 UTC [63] LOG: disconnection: session time: 0:00:00.011 user=otelu database=postgres host=172.18.0.1 port=60700 2024-01-29T17:10:00.987237072Z 2024-01-29 17:10:00.986 UTC [64] LOG: connection received: host=172.18.0.1 port=60716 2024-01-29T17:10:00.990905486Z 2024-01-29 17:10:00.990 UTC [64] LOG: connection authenticated: identity="otelu" method=scram-sha-256 (/var/lib/postgresql/data/pg_hba.conf:100) 2024-01-29T17:10:00.990924449Z 2024-01-29 17:10:00.990 UTC [64] LOG: connection authorized: user=otelu database=postgres 2024-01-29T17:10:01.003217141Z 2024-01-29 17:10:01.002 UTC [65] LOG: connection received: host=172.18.0.1 port=60722 2024-01-29T17:10:01.007543161Z 2024-01-29 17:10:01.006 UTC [65] LOG: connection authenticated: identity="otelu" method=scram-sha-256 (/var/lib/postgresql/data/pg_hba.conf:100) 2024-01-29T17:10:01.007561344Z 2024-01-29 17:10:01.006 UTC [65] LOG: connection authorized: user=otelu database=otel2 2024-01-29T17:10:01.035387631Z 2024-01-29 17:10:01.033 UTC [65] LOG: disconnection: session time: 0:00:00.030 user=otelu database=otel2 host=172.18.0.1 port=60722 2024-01-29T17:10:01.035411643Z 2024-01-29 17:10:01.033 UTC [64] LOG: disconnection: session time: 0:00:00.046 user=otelu database=postgres host=172.18.0.1 port=60716 2024-01-29T17:10:01.035414531Z 2024-01-29 17:10:01.033 UTC [62] LOG: disconnection: session time: 0:00:00.060 user=otelu database=postgres host=172.18.0.1 port=60696 2024-01-29T17:10:01.035980287Z 2024-01-29 17:10:01.033 UTC [61] LOG: disconnection: session time: 0:00:00.069 user=otelu database=postgres host=172.18.0.1 port=60680 ```

This prevents us using/rolling out this receiver more widely as we cannot drop logs nor disable those log configurations on the PostgreSQL side either.

Describe the solution you'd like

Connection reuse/pooling to avoid having these spike of logs on each scrape interval across our fleet e.g. by using a connection pool (maybe?). Could be an opt-in configuration of course.

Describe alternatives you've considered

Forking the receiver in our internal custom build

Additional context

To get an idea, I've ran the integration test locally with both PG configurations log_connections & log_disconnections enabled:

github-actions[bot] commented 8 months ago

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

djaglowski commented 8 months ago

This seems like a reasonable improvement to me. @kevinnoel-be are you interested in submitting a PR for this?

kevinnoel-be commented 8 months ago

@djaglowski We would prefer to have this in upstream so yes I'd be willing to take a stab at this. I'd be curious how you see those changes happening:

djaglowski commented 8 months ago

Thanks @kevinnoel-be, I'll assign the issue to you and look forward to your PR.

  • keep current behavior (even lib/pq module?) by default and optionally enable pooling?
  • enable pooling by default and replace lib/pq altogether?

Since this is a substantial change to the internals of the receiver, I think we must manage this transition with a feature gate. Basically, we'll have to give the user a choice between the libraries. When the gate is alpha, lib/pq is used by default, but optionally you can use the other. Then once this has been validated by some users, we can bump the gate to beta, which makes the other library default, but still leaves an escape hatch to switch back to lib/pq. Once this has been well validated, we can mark the gate as stable and eventually remove it.

should we expose some/most/all of the pooling config

We can work out details here but I think the following guidelines should apply:

  1. Since all these settings are naturally related, we should group them under a subsection of the config. (e.g. connection_pool)
  2. We should start with minimal settings but can expose additional settings over time as necessary.
kevinnoel-be commented 8 months ago

Thanks @djaglowski I'll take those in account for creating the PR.

Another question that'll drive which version of pgx we can use for those changes here: I see that this receiver is marked as supporting PG 9.6+ in the README. Do we still want to keep this much backwards compatibility given that this is quite an old version now (list of PG versions here). It seems that the latest supported version is PG 12 with still ~9 months of support.

djaglowski commented 8 months ago

@kevinnoel-be, I think it's probably reasonable to drop support for older versions but this should be a separate issue so others have a chance to comment.

kevinnoel-be commented 8 months ago

@djaglowski I've created https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/30923 to open the discussion around the version support.