giantswarm / roadmap

Giant Swarm Product Roadmap
https://github.com/orgs/giantswarm/projects/273
Apache License 2.0
3 stars 0 forks source link

Alert customer when customer managed volume is full #1915

Open brinker211 opened 1 year ago

brinker211 commented 1 year ago

Is your feature request related to a problem? Please describe.

As a user of Giant Swarm, I'd like to be alerted when volumes are full even outside of core volumes (root, kubelet, docker).

Describe the solution you'd like

In the best case, filling up a customer managed volume should send a slack message to the customer's channel but not page the Giant Swarm oncall team.

Describe alternatives you've considered

Run a cleanup script similar to what we are doing for docker on job for garbage collection for all volumes. However this would just alleviate the pressure and not fix the root of the problem whatever is filling the disk up.

Additional context

Internal Request Reference

TheoBrigitte commented 1 year ago

I am unsure how we should handle this topic. I see customer have a need to detect when the volume they managed are full, I assume we are talking about WC disks here.

So far we have not been monitoring customer workload and maybe we should.

Also when we start sending specific customer alerts only to customer slack channel, who is in charge of maintaining those alerts ? Or can we re-use existing alerts and dynamically route them either to Giant Swarm oncall team or customer ?

Also would this be by default or an opt-in feature ?

Where do we send alerts to? Can we automatically create a specific slack alert channel for each customer ?

How many customers are we talking about here ? Can we have a simpler solution to forward some alerts to their slack ?

cc/ @giantswarm/team-horizon

teemow commented 1 year ago

I'd also say that a specific solution just for volumes doesn't make sense. We should rather think about a general set of alerts for platform or developer teams of our customers. I am not sure how we can achieve this but it would be good if we provide a list of alerting rules that the customers can adapt and extend without losing the capability to upgrade and benefit from further improvement by us. And the alert routing also needs to be configurable through our MAPI.

pipo02mix commented 1 year ago

yeah as I envision it, we serve a monitoring solution (Prometheus in our case) and they can deploy it with the right configuration to enable the monitoring of volumes and alerts. So in this case, we can enable such a use case or document it how it can be done using our solution. WDYT?

JosephSalisbury commented 1 year ago

@pipo02mix i'd go a bit further with the developer platform idea, i wouldn't want customers to have to deploy some chart, i feel like it should be somewhat more "out of the box" / integrated

teemow commented 1 year ago

@pipo02mix this was not what I was talking about. I was talking about the MC prometheus setup. In general we want to open up the central monitoring solution for our customers.

@brinker211 is this urgent for a customer to have monitoring about a specific volume or is this a general feature request? (its on the roadmap and not in the customer board so it looks like this is more of a general idea - but I am not sure)

teemow commented 1 year ago

@pipo02mix what @JosephSalisbury said :grin:

brinker211 commented 1 year ago

No this is not urgent but more of a general request as you guessed. This was also improved by some of the other monitoring stories like opening up the queries to customers. Treating this as a general use-case for giving more details or heads up to customers is a good way forward.

teemow commented 1 year ago

@TheoBrigitte can you rewrite the story so that we provide customers with a way to configure alert routing via the management api? Or do you have a story like this already?

TheoBrigitte commented 1 year ago

Tracking this into https://github.com/giantswarm/roadmap/issues/2213