vm-builder: add SQL exporter to vector

skyzh commented 5 months ago

ref https://github.com/neondatabase/neon/pull/5949 close https://github.com/neondatabase/autoscaling/issues/872

This pull request makes the LFC metrics available at the vector metrics endpoint. We use the default scrape interval to avoid overload the database.

skyzh commented 5 months ago

Welcome to Alpine!
 ~ This is the VM :) ~
neonvm:~# curl 10.0.2.18:9100/metrics | grep lfc
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 26016  100 26016    0     0  12.1M      0 --:--:-- --:--:-- --:--:-- 24.8M
# HELP lfc_cache_size_limit lfc_cache_size_limit
# TYPE lfc_cache_size_limit gauge
lfc_cache_size_limit 12700352512 1711465391821
# HELP lfc_hits lfc_hits
# TYPE lfc_hits gauge
lfc_hits 0 1711465391821
# HELP lfc_misses lfc_misses
# TYPE lfc_misses gauge
lfc_misses 374 1711465391821
# HELP lfc_used lfc_used
# TYPE lfc_used gauge
lfc_used 111 1711465391821
# HELP lfc_writes lfc_writes
# TYPE lfc_writes gauge
lfc_writes 374 1711465391821
neonvm:~#

skyzh commented 5 months ago

mark ready for review

Bodobolero commented 5 months ago

@skyzh pls also add the working set size in pages to the sql_exporter as a metric https://github.com/neondatabase/neon/blob/8dfe3a070cd04dd2310ed07e1f38f4257dd43cd8/vm-image-spec.yaml#L188

select approximate_working_set_size(false);

skyzh commented 5 months ago

I'm using the default scrape interval for sql exporters, which is usually 15 secs (1 sec is for host information).

skyzh commented 5 months ago

select approximate_working_set_size(false);

will have a separate pull request on the compute side and make it into the next release.

sharnoff commented 5 months ago

It is not just using resources in compute but also in our victoriametrics database if we scrape so frequently?

These metrics are exclusively consumed by the autoscaler-agent and not persisted anywhere.

The autoscaler-agent fetches metrics (and potentially makes a scaling decision) every 5 seconds.

IMO it's worthwhile to not need to worry about whether those metrics are stale; I don't immediately see a reason not to fetch sql-exporter metrics every 5 seconds (or even every 1 second) - presumably it's pretty cheap?

skyzh commented 5 months ago

There are some aggregations and joins on the system catalog, and therefore, if there are a lot of tables, it might be slow to execute.

https://github.com/neondatabase/neon/blob/47d2b3a4830f6d5ecb84086e785ec0f913390176/vm-image-spec.yaml#L162-L172

A better approach is to separate two sql exporters: the ones that are cheap to scrape, and the expensive ones. All LFC metrics are basically O(1) operations when being retrieved from the database.

skyzh commented 5 months ago

This pull request will be merged once the compute + console release is done next week. There are still some leftover works from upgrading the compute, and I will ensure that at the time this pull request gets merged, the metrics I put in this vector config will be available from all neon user projects.

skyzh commented 5 months ago

I think this can also be implemented by telling the autoscaler-agent another port to fetch metrics from (per each VM).

Do you mean that the autoscaler-agent can directly query LFC data by logging in to the database using Postgres protocol + cloud_admin? In this case, do we need to store that data somewhere inside cplane? I thought the reason that we have vector.dev is to keep some history of the data and avoid storing them by autoscaling-agent itself, but I could be wrong...

skyzh commented 5 months ago

Okay I just re-read https://github.com/neondatabase/autoscaling/blob/main/ARCHITECTURE.md and I think I had some misunderstanding there. So I feel the best approach is:

Scrape the data directly using SQL. The autoscaling-agent will connect to the database using Postgres protocol and run SQL queries to collect the data.
Find a way to access these data. As the agent runs outside of the compute pod, it cannot directly log in to cloud_admin. There are 3 approaches: assign a temporary password for cloud_admin; modify ip list so that the agent can connect without a password; create a proxy inside the pod on some port to delegate SQL queries.

skyzh commented 5 months ago

create a proxy inside the pod on some port to delegate SQL queries

can we make this into vm-monitor?

sharnoff commented 5 months ago

Having the autoscaler-agent fetch the data via SQL is an interesting idea — hadn't thought of that.

I was thinking more like having it fetch metrics from sql-exporter via prometheus (http). It's already exposed from every VM, so the change would be entirely within the autoscaler-agent.

Omrigan commented 5 months ago

Having the autoscaler-agent fetch the data via SQL is an interesting idea — hadn't thought of that.

That feels like smashing abstraction layers. Does it yield any practical advantage over below?

I was thinking more like having it fetch metrics from sql-exporter via prometheus (http). It's already exposed from every VM, so the change would be entirely within the autoscaler-agent.

This also, but less so than above. I see now a practical advantage (no bufferization) of bypassing vector. Also, @neondatabase/billing folks say vector is unreliable :slightly_smiling_face: so I guess this is my preference.

skyzh commented 5 months ago

I can open a pull request on the compute side to have vm-monitor directly exposing the metrics

skyzh commented 5 months ago

I have a demo pull request ready but not tested yet, feel free to leave some comments before we finalize the code + method of exposing the metrics. https://github.com/neondatabase/neon/pull/7302

neondatabase / autoscaling

vm-builder: add SQL exporter to vector #878