neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
15.33k stars 447 forks source link

sql_exporter is losing metrics if compute is very busy #9960

Open Bodobolero opened 1 day ago

Bodobolero commented 1 day ago

Steps to reproduce

run ingest benchmark doc

Expected result

We see metrics collected by sql_exporter for the complete run

Actual result

we are losing metrics - most likely because sql_exporter is exceeding its scrape_timout

we observe this especially when there is large amount of backpressure from PS to compute

Environment

staging

Logs, links

https://neonprod.grafana.net/d/de3mupf4g68e8e/perf-test3a-ingest-benchmark?orgId=1&from[…]ge_tenant_endpoint_id=ep-misty-river-w2vdg495&viewPanel=19

first reported here

another observation of this - probably related

https://neondb.slack.com/archives/C04DGM6SMTM/p1731526874214679

ololobus commented 1 day ago

Previous thread re this problem https://neondb.slack.com/archives/C04DGM6SMTM/p1731526874214679

Ultimately, on each scrape sql_exporter does all the SQL specified in the metrics config. So if compute is loaded, then SQL becomes slower and we see these gaps.

So what are the options we have?

  1. Try to identify the most heavy queries and optimize them
  2. Untie collection and reporting flows (there are tools that do that, e.g. Telegraf, iiuc). In this case, collection will be super fast, scrapes can have much longer timeouts, but instead of gaps we may see stale metrics (up to configured interval). See my comment about Telegraf (not 100% I understand it correctly), @mickael-carl was against this approach
  3. Switch as many metrics from the collection via SQL to maintaining counters/histograms online as possible, and report them from prometheus endpoint in our extension, for example. This may be also combined with 1. I'm not sure that this path is totally realistic, though. There is some Postgres statistics maintained in the catalog, or inside Postgres shared memory structures, so we would need a lot of core patching, and it's still not clear that it will work well. Like the database size, for example. I think we need to discuss that, maybe there are some low-hanging fruits
ololobus commented 3 hours ago

Moved to backlog because we don't have any good ideas how to fox it except exploring another tool like Telegraf

@tristan957 suggests that we can bump the sql_exporter version

ololobus commented 3 hours ago

Thread about timeout issues, looks like we currently scrape every 10s, so we cannot bump the timeout significantly

ololobus commented 3 hours ago

Another piece of info from Tristan, sql_exporter seems to have its own metrics

Only metrics defined by collectors are exported on the /metrics endpoint. SQL Exporter process metrics are exported at /sql_exporter_metrics.