grafana / alloy

OpenTelemetry Collector distribution with programmable pipelines
https://grafana.com/oss/alloy
Apache License 2.0
1.39k stars 203 forks source link

Increasing number of go routines when postgresql cannot be accessed #1929

Open KaczDev opened 2 weeks ago

KaczDev commented 2 weeks ago

What's wrong?

I have noticed that at one of our production servers there way too many go routines reported in comparison to other instances. Image

When I checked the logs I found that there are issues with connectin to the PostgreSQL database using prometheus.exporter.postgres.

I have reproduced the issue with an Alloy instance that has configured a postgres exporter to a database that it cannot connect to (pg_hba rejected or simply the db doesn't exist). Image

Steps to reproduce

Create a prometheus.exporter.postgres component and configure the URL to a non-existing postgresql URL.

System information

No response

Software version

Grafana Alloy v.1.4.2

Configuration

logging {
    level  = "info"
    format = "logfmt"
}

prometheus.exporter.postgres "POSTGRESQL_METRICS" {
    data_source_names = ["postgresql://postgres:postgres@postgres:5432/postgres?sslmode=disable"]
    autodiscovery {
        enabled = true
    }
}

prometheus.scrape "SCRAPE_POSTGRESQL_METRICS" {
    targets    = prometheus.exporter.postgres.POSTGRESQL_METRICS.targets
    forward_to = [prometheus.remote_write.PROMETHEUS_REMOTE_WRITE.receiver]
    // increased the number of intervals to reproduce the issue faster.
    scrape_interval = "5s"
    scrape_timeout="4s"
}

prometheus.exporter.self "SELF_REPORT" {}

prometheus.scrape "SELF_SCRAPER" {
    targets    = prometheus.exporter.self.SELF_REPORT.targets
    forward_to = [prometheus.remote_write.PROMETHEUS_REMOTE_WRITE.receiver]
    scrape_interval = "15s"
}

prometheus.remote_write "PROMETHEUS_REMOTE_WRITE" {
    endpoint {
        url = format("http://%s:%s/api/v1/push", env("MIMIR_URL"), env("MIMIR_PORT")) 
    }
}

Logs

ts=2024-10-18T15:36:51.273854616Z level=info msg="Established new database connection" component_path=/ component_id=prometheus.exporter.postgres.POSTGRESQL_METRICS fingerprint=postgres:5432
ts=2024-10-18T15:36:51.348885966Z level=error msg="Error opening connection to database" component_path=/ component_id=prometheus.exporter.postgres.POSTGRESQL_METRICS err="error querying postgresql version: dial tcp: lookup postgres on 127.0.0.11:53: server misbehaving"
ts=2024-10-18T15:36:52.34972657Z level=error msg="Error opening connection to database" component_path=/ component_id=prometheus.exporter.postgres.POSTGRESQL_METRICS dsn="postgresql://postgres:PASSWORD_REMOVED@postgres:5432/postgres?sslmode=disable" err="dial tcp: lookup postgres on 127.0.0.11:53: server misbehaving"
dehaansa commented 1 week ago

This appears to be an issue that was resolved in newer releases of the postgres prometheus exporter. Will take a look at updating the exporter version that is bundled with Alloy. Thanks for the detailed bug report!

dehaansa commented 1 week ago

After some testing, it appears that while other connection leaks have been fixed, this is still an issue in v0.15.0 of the prometheus exporter. Looking into fixing the issue upstream.