neondatabase / autoscaling

Postgres vertical autoscaling in k8s
Apache License 2.0
150 stars 20 forks source link

agent: Support fetching LFC metrics (but don't use them yet) #895

Closed sharnoff closed 2 months ago

sharnoff commented 5 months ago

Probably best to review each commit separately.

Most of the commits here are just repositioning to make the final commit, which adds LFC metrics collection, as simple as possible.

Builds on #892, must not be merged before then.

skyzh commented 4 months ago

Discussion from the 1-1 meeting: scraping sql_exporter from the autoscaling-agent is the right way to go. I'll add a separate sql_exporter solely for autoscaling metrics. For now, we can just scrape the current one with a reasonable interval.

sharnoff commented 3 months ago

Discussion with @Omrigan : It's fine to just hard-code the metric names, and that makes it a lot simpler. Some discussion about whether to encapsulate the callbacks for metrics collection into a single struct with func(...) fields or an interface.

sharnoff commented 3 months ago

New changes:

  1. Move metric names into the code (rather than config) and remove the generics it required. Diff here: 4b06d09..0bb2b59.
  2. Fixed a bug where we'd always use the "system" metrics port, even for LFC metrics: da48a4cd3f89ef8eaceaf620e7e6faff781aef77 (NOTE: this was from reading the code; I haven't yet tested it)
  3. Switch from unnamed callbacks to a new metricsMgr[M] struct to contain them and the kind string: 5dc98a95675fcb6573278834a688b123eda89149
  4. Switch from the emptyMetrics() constructor callback to a generics-based approach: 031cc2d13a74e4c232122053404b6119cae04f47
    • cc @Omrigan particularly want your feedback on this one. I could go either way on this -- left it as a separate commit so it's easy to revert.
  5. Deduplicate metrics config parsing: 86cb9f4a277a5357f95920a82d25b074eef3e60a + c1ae8d3b1ed5c2badcf0d3873b939b6be70137a9

edit: And the first commit has been spun off into a separate PR: #948.

sharnoff commented 3 months ago

Continuing with the changelog approach, if only because it helps me manage the rebasing:

  1. Reverted 031cc2d13a74e4c232122053404b6119cae04f47 ("tmp: Switch from emptyMetrics() callback to generics"): 60f835b5fc21bc867d8b3a466f1896a22d44f6dc
  2. Squashed everything back down: 60f835b..666e11f
  3. Rebased onto main: 666e11f02bae311c0ac57dbe129e42f1487262db -> 34afd60dc06f5db3551ae1893054716a65cf5b2e
  4. Renamed newMetrics -> updateMetrics: cb14dafe0dcefdb61bec02e365c9626af81b2cfa+b9e3ebd1ea51a7d3730b965d3ecd77c0318776f4
  5. Fix outdated comment on FromPrometheus: 662613d170363fa9cf240e5a0225e0c2f1c09674
  6. Renamed the c/l context/logger variables to ctx2/logger2: efdda7c819bbfe1a5178b884d9803438b5cebba1
  7. Remove 5-minute load average from the system metrics names: 589f6b7485a97732d1297610338f7e5b6e1a0833
  8. Rework system metrics string constants: 66b92893a804fc0f17b84f7c67358d9610573cb1

Plus a couple more changes to #948 based on review here.

sharnoff commented 2 months ago

Final changes:

  1. Squash previous commits. Diff here (empty): 66b9289..20bd2ba
  2. Rebase onto main (dropping #948 from this branch). Diff here: 20bd2ba..fb2f910
  3. Fix compile error from rebase: 689e18d0d66b935f9c0ffba6cae5df2e4f8f8306
  4. Change metricsMgr[M any] -> metricsMgr[M core.FromPrometheus]: d0991a9d2447ff1c2a37e0cf2241422df51dea52

Waiting on some issues with latest release. Will hold off on merging for now to avoid accidentally adding much larger changes alongside a small fix.