Make metrics LFC number of hits, number of misses and working set size available to autoscaling algorithm

stradig commented 5 months ago

We want to be able to experiment with the algorithm to see which of those values can improve performance for autoscaled computes.

- [ ] neondatabase/autoscaling#895
- [ ] neondatabase/neon#7514
- [ ] neondatabase/cloud#14245
- [ ] neondatabase/neon#7466
- [ ] neondatabase/neon#8298
- [ ] #1003
- [ ] neondatabase/cloud#14929

skyzh commented 5 months ago

Need to investigate how to export data using SQL statements. This does not seem to be supported by vector.dev.

sharnoff commented 5 months ago

IIRC the existing metrics are exposed by sql-exporter — I think vector could just pull from there, if we want to expose it via vector.

skyzh commented 5 months ago

yep, I found https://vector.dev/docs/reference/configuration/sources/prometheus_scrape/ that directly scrapes exporter data.

Omrigan commented 4 months ago

So we have 4 possible ways to go forwad:

Fetch from vector (#878)
- Disadvantage: adds an additional delay between sql-exporter and vector
Fetch from sql-exporter (#895)
- Disadvantage: sql-exporter fetches a number of things and it might overload the database if we fetch it every 5-15s
Fetch from vm-monitor (https://github.com/neondatabase/neon/pull/7302#event-12376151432)
- Disadvantage: one more place to implement working with metrics
Fetch directly from postgres
- Disadvantage: breaks abstraction layers, needs somehow to put credentials into the autoscaler-agent

@skyzh @sharnoff sounds correct? Which ones do you prefer?

sharnoff commented 4 months ago

My thoughts — I want to avoid adding tech debt by linking together components that weren't previously linked.

Fetch from vector — modifies vector here to support sql-exporter in neondatabase/neon, adding a new link. Also has the downside of repeating metric values because the autoscaler-agent fetch frequency would be greater than vector's refresh frequency.
Fetch from sql-exporter — mostly doesn't add a new link beyond what's required for this issue; the autoscaler-agent already fetches prometheus metrics from the VM. That's why I went with this approach.
Fetch from vm-monitor — adds a new responsibility to vm-monitor, and would also require additional support in the autoscaler-agent. All work done on the autoscaler-agent <-> vm-monitor protocol should be approached with hazmat suits for now. It does what we need it to, but it needs a lot of work, and I'm hesitant to add more responsibilities to it until after some refactoring has taken place.
Fetch directly from postgres — adds a new link between autoscaler-agent and postgres, like you said @Omrigan. And yeah, credentials would be quite tricky, requiring help from other components we don't currently rely on.

re:

sql-exporter fetches a number of things and it might overload the database if we fetch it every 5-15s

The current state of #895 is to have a configurable port and frequency — we can fetch as slow as we need to. For the ext-metrics datasources, we already do query every 15s (or maybe even more frequently?). Once a secondary sql-exporter is added with just the cheap metrics, we can e.g. add support for gradual rollout of fetching from a different port, faster, eventually switching everything over once old VMs restart.

Omrigan commented 4 months ago

@skyzh Can you share your opinion on options 2 vs 3?

skyzh commented 4 months ago

If we want to have a second sql-exporter, I'm fine with either option 2 or 3. Otherwise, there needs to be a place to fetch these metrics, and it is easier to happen in vm-monitor.

skyzh commented 4 months ago

...to be specific, I assume that autoscaling agent will at some point scrape these data at a high frequency, and I don't want these SQLs to be executed when we scrape sql-exporter:

https://github.com/neondatabase/neon/blob/2d5a8462c8093fb7db7e15cea68c6d740818c39c/vm-image-spec.yaml#L161-L188

Therefore, I'm proposing not go into the normal metrics sql-exporter for autoscaling metrics.

sharnoff commented 4 months ago

Discussed briefly with @skyzh — tl;dr:

Medium-term, we want to avoid having the autoscaler-agent pull LFC metrics from the main sql-exporter
Short-term:
1. We can have the autoscaler-agent pull metrics from the existing sql-exporter, just with a low frequency so we don't overload postgres
2. We can set up a second sql-exporter to just report LFC metrics
Then, we can have control plane set some annotation on new VMs to tell the autoscaler-agent to fetch LFC metrics with a higher frequency from the new port — giving the desired end state while retaining support for older VMs.

sharnoff commented 2 months ago

Status:

895 is ready to merge, just was waiting to avoid interfering with patch release
We found out the new metrics weren't exposed. PR to fix is neondatabase/cloud#14245
Remaining work after that is actually using the metrics (design + implementation of new scaling algorithm, maybe?)

sharnoff commented 3 weeks ago

Earlier this week, LFC-aware scaling was completely rolled out to all regions. Closing this :)

neondatabase / autoscaling

Make metrics LFC number of hits, number of misses and working set size available to autoscaling algorithm #872

895 is ready to merge, just was waiting to avoid interfering with patch release