Open problame opened 1 year ago
Does this metric depend on tenant size or any other thing? Because for SLO, it makes sense to remove the "noise" first.
Does this metric depend on tenant size or any other thing?
Suspected bottlenecks right now:
Regardless, I think we should aspire to something like 5 seconds after restart, all tenants are "Active" or "Broken".
I think this is achievable.
Because for SLO, it makes sense to remove the "noise" first.
Obviously, we won't add alerts which we know we'll break. We'll add the metric, create a dashboard, measure, understand, fix first.
Regardless, I think we should aspire to something like 5 seconds after restart, all tenants are "Active" or "Broken".
It sounds relevant
Relevant to what? To your
Does this metric depend on tenant size or any other thing? Because for SLO, it makes sense to remove the "noise" first.
or generally relevant?
generally, to start with it (about the SLO)
This is related to #4025.
I am eager to see the distribution of these activations, then I can comment more on if that makes sense as an SLO.
Edited the description to include alerting on contiguous 1
-time of the gauge.
It's a really slow CI day and I am eager to test unrelated code in staging. ~Might as well hack these two because~ I created the initial load time watching already in e879d6c.
Also, we can alert on the contiguous 1-time of the gauge not exceeding a threshold
Can this be implemented in promql?
Later remembered: Tenant activations which happen as a result of creation will be instant, because there is no other load. I at least wouldn't want them on the same histogram because then it will say "some large percentile is very fast", even if on restart init activations would take 130s.
There's a more fresh duplicate: #4183 and an maybe an epic as well.
Fixes are in place: #4399 and are working, at best giving us 1ms per timeline, but at worst much more. Added #4892 for us to get understanding how long do things take.
Closing this to focus on #4183.
Moved to https://github.com/neondatabase/neon/issues/4025#issuecomment-1665253048.
Closed wrong issue.
I went ahead and added the metrics for this in https://github.com/neondatabase/neon/pull/4893
We should have a pageserver-level SLO for the time it takes until all tenants of the pageserver have reached state "Active" or "Broken".
This can be broken down into two metrics:
What to do with the metrics:
1
-time of the gauge not exceeding a thresholdRelated: