neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
15.01k stars 438 forks source link

add pageserver SLO for startup performance: tenant load & time-to-active #4083

Open problame opened 1 year ago

problame commented 1 year ago

We should have a pageserver-level SLO for the time it takes until all tenants of the pageserver have reached state "Active" or "Broken".

This can be broken down into two metrics:

  1. one gauge metric that is 1 exactly while the tenant loads initiated by tenant::mgr::init are going on
  2. a global histogram that tracks time-to-active

What to do with the metrics:

Related:

vadim2404 commented 1 year ago

Does this metric depend on tenant size or any other thing? Because for SLO, it makes sense to remove the "noise" first.

problame commented 1 year ago

Does this metric depend on tenant size or any other thing?

Suspected bottlenecks right now:

Regardless, I think we should aspire to something like 5 seconds after restart, all tenants are "Active" or "Broken".

I think this is achievable.

Because for SLO, it makes sense to remove the "noise" first.

Obviously, we won't add alerts which we know we'll break. We'll add the metric, create a dashboard, measure, understand, fix first.

vadim2404 commented 1 year ago

Regardless, I think we should aspire to something like 5 seconds after restart, all tenants are "Active" or "Broken".

It sounds relevant

problame commented 1 year ago

Relevant to what? To your

Does this metric depend on tenant size or any other thing? Because for SLO, it makes sense to remove the "noise" first.

or generally relevant?

vadim2404 commented 1 year ago

generally, to start with it (about the SLO)

koivunej commented 1 year ago

This is related to #4025.

koivunej commented 1 year ago

I am eager to see the distribution of these activations, then I can comment more on if that makes sense as an SLO.

problame commented 1 year ago

Edited the description to include alerting on contiguous 1-time of the gauge.

koivunej commented 1 year ago

It's a really slow CI day and I am eager to test unrelated code in staging. ~Might as well hack these two because~ I created the initial load time watching already in e879d6c.

Also, we can alert on the contiguous 1-time of the gauge not exceeding a threshold

Can this be implemented in promql?

Later remembered: Tenant activations which happen as a result of creation will be instant, because there is no other load. I at least wouldn't want them on the same histogram because then it will say "some large percentile is very fast", even if on restart init activations would take 130s.

koivunej commented 1 year ago

There's a more fresh duplicate: #4183 and an maybe an epic as well.

Fixes are in place: #4399 and are working, at best giving us 1ms per timeline, but at worst much more. Added #4892 for us to get understanding how long do things take.

Closing this to focus on #4183.

Moved to https://github.com/neondatabase/neon/issues/4025#issuecomment-1665253048.

koivunej commented 1 year ago

Closed wrong issue.

jcsp commented 1 year ago

I went ahead and added the metrics for this in https://github.com/neondatabase/neon/pull/4893