add pageserver SLO for startup performance: tenant load & time-to-active

neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.

https://neon.tech

Apache License 2.0

15.01k stars 438 forks source link

Open problame opened 1 year ago

problame commented 1 year ago

We should have a pageserver-level SLO for the time it takes until all tenants of the pageserver have reached state "Active" or "Broken".

This can be broken down into two metrics:

one gauge metric that is 1 exactly while the tenant loads initiated by tenant::mgr::init are going on
a global histogram that tracks time-to-active
- we already log this in https://github.com/neondatabase/neon/pull/4080 , just need to add the histogram

What to do with the metrics:

We can then multiply the histogram with the gauge and alert on outliers.
Also, we can alert on the contiguous 1-time of the gauge not exceeding a threshold

vadim2404 commented 1 year ago

Does this metric depend on tenant size or any other thing? Because for SLO, it makes sense to remove the "noise" first.

problame commented 1 year ago

Does this metric depend on tenant size or any other thing?

Suspected bottlenecks right now:

get remote index_part.json's
- network latency is dominant here
- concurrency limiter says hello as well
building the layer maps
- this is CPU-bound

Regardless, I think we should aspire to something like 5 seconds after restart, all tenants are "Active" or "Broken".

I think this is achievable.

Because for SLO, it makes sense to remove the "noise" first.

Obviously, we won't add alerts which we know we'll break. We'll add the metric, create a dashboard, measure, understand, fix first.

vadim2404 commented 1 year ago

Regardless, I think we should aspire to something like 5 seconds after restart, all tenants are "Active" or "Broken".

It sounds relevant

problame commented 1 year ago

Relevant to what? To your

Does this metric depend on tenant size or any other thing? Because for SLO, it makes sense to remove the "noise" first.

or generally relevant?

vadim2404 commented 1 year ago

generally, to start with it (about the SLO)