Alerts for Loki and Mimir rightsizing

Rotfuks commented 4 months ago

Motivation

We already have VPA for Mimir and Loki but with HPA it's a bit tricky. We can manually rightsize Mimir and Loki but we need an alert to know when we can do it.

Todo

For both Mimir and Loki

[x] We need one alert that gets triggered when Loki is having a lot of resources assigned but is not really consuming it, so we can downsize
[x] We need one alert that gets triggered when we need to roll back our downsizing
- [ ] (But Hervé doesn't really know yet how to do that :)
- [ ] Now Hervé has an idea finally: If HPA scales up, downsizing doesn't make sense

Outcome

we get alerted when we can safe some resources (or have to roll that back)

QuantumEnigmaa commented 4 months ago

@hervenicol What's your opinion on this ? Since all loki components have an assigned hpa, does it make sense to be alerted when the resources assigned are not actually used in the same way we do it for mimir ? I mean the hpas should scale down the components at some point right (after having flushed all the data to the object-storage) ?

hervenicol commented 4 months ago

HPAs will scale down until minpods. Then, for instance with loki-write pods RAM, the minimal size is 2 pods with 4GB requests and 8GB limit. On some installations, that may still be more than needed.

So, maybe we want an alert that tells us when those pods are under used.

The part I'm not confident with is how to know when we should revert these changes? Because, say we reduced requests/limits to 1GB/2GB. Then, the installation grows, and HPA adds new pods: that's expected. But we need to review the requests/limits at some point. If we do it when usage is over 90% (ie HPA's scale up threshold), it means it will happen when HPA is maxed out at maxpods (25 for lok-write as for what I can see on golem :astonished: ). Or can we have a better alert condition?

I'm not very comfortable with this issue as HPAs are supposed to do the job. This issue tries to improve a case where we have small installations that don't use all of Loki's reserved resources. Maybe we should start with checking the current situation: do we have installations where that's the case, and how critical this is?

QuantumEnigmaa commented 4 months ago

It could be interesting to discuss about it during refinement :)

QuantumEnigmaa commented 3 months ago

Both alerts and corresponding ops recipe are created and released.

giantswarm / roadmap