neondatabase / autoscaling

Postgres vertical autoscaling in k8s
Apache License 2.0
150 stars 20 forks source link

Make metrics LFC number of hits, number of misses and working set size available to autoscaling algorithm #872

Closed stradig closed 3 weeks ago

stradig commented 5 months ago

We want to be able to experiment with the algorithm to see which of those values can improve performance for autoscaled computes.

- [ ] neondatabase/autoscaling#895
- [ ] neondatabase/neon#7514
- [ ] neondatabase/cloud#14245
- [ ] neondatabase/neon#7466
- [ ] neondatabase/neon#8298
- [ ] #1003
- [ ] neondatabase/cloud#14929
skyzh commented 5 months ago

Need to investigate how to export data using SQL statements. This does not seem to be supported by vector.dev.

sharnoff commented 5 months ago

IIRC the existing metrics are exposed by sql-exporter — I think vector could just pull from there, if we want to expose it via vector.

skyzh commented 5 months ago

yep, I found https://vector.dev/docs/reference/configuration/sources/prometheus_scrape/ that directly scrapes exporter data.

Omrigan commented 4 months ago

So we have 4 possible ways to go forwad:

  1. Fetch from vector (#878)
    • Disadvantage: adds an additional delay between sql-exporter and vector
  2. Fetch from sql-exporter (#895)
    • Disadvantage: sql-exporter fetches a number of things and it might overload the database if we fetch it every 5-15s
  3. Fetch from vm-monitor (https://github.com/neondatabase/neon/pull/7302#event-12376151432)
    • Disadvantage: one more place to implement working with metrics
  4. Fetch directly from postgres
    • Disadvantage: breaks abstraction layers, needs somehow to put credentials into the autoscaler-agent

@skyzh @sharnoff sounds correct? Which ones do you prefer?

sharnoff commented 4 months ago

My thoughts — I want to avoid adding tech debt by linking together components that weren't previously linked.

  1. Fetch from vector — modifies vector here to support sql-exporter in neondatabase/neon, adding a new link. Also has the downside of repeating metric values because the autoscaler-agent fetch frequency would be greater than vector's refresh frequency.
  2. Fetch from sql-exporter — mostly doesn't add a new link beyond what's required for this issue; the autoscaler-agent already fetches prometheus metrics from the VM. That's why I went with this approach.
  3. Fetch from vm-monitor — adds a new responsibility to vm-monitor, and would also require additional support in the autoscaler-agent. All work done on the autoscaler-agent <-> vm-monitor protocol should be approached with hazmat suits for now. It does what we need it to, but it needs a lot of work, and I'm hesitant to add more responsibilities to it until after some refactoring has taken place.
  4. Fetch directly from postgres — adds a new link between autoscaler-agent and postgres, like you said @Omrigan. And yeah, credentials would be quite tricky, requiring help from other components we don't currently rely on.

re:

sql-exporter fetches a number of things and it might overload the database if we fetch it every 5-15s

The current state of #895 is to have a configurable port and frequency — we can fetch as slow as we need to. For the ext-metrics datasources, we already do query every 15s (or maybe even more frequently?). Once a secondary sql-exporter is added with just the cheap metrics, we can e.g. add support for gradual rollout of fetching from a different port, faster, eventually switching everything over once old VMs restart.

Omrigan commented 4 months ago

@skyzh Can you share your opinion on options 2 vs 3?

skyzh commented 4 months ago

If we want to have a second sql-exporter, I'm fine with either option 2 or 3. Otherwise, there needs to be a place to fetch these metrics, and it is easier to happen in vm-monitor.

skyzh commented 4 months ago

...to be specific, I assume that autoscaling agent will at some point scrape these data at a high frequency, and I don't want these SQLs to be executed when we scrape sql-exporter:

https://github.com/neondatabase/neon/blob/2d5a8462c8093fb7db7e15cea68c6d740818c39c/vm-image-spec.yaml#L161-L188

Therefore, I'm proposing not go into the normal metrics sql-exporter for autoscaling metrics.

sharnoff commented 4 months ago

Discussed briefly with @skyzh — tl;dr:

sharnoff commented 2 months ago

Status:

sharnoff commented 3 weeks ago

Earlier this week, LFC-aware scaling was completely rolled out to all regions. Closing this :)