giantswarm / roadmap

Giant Swarm Product Roadmap
https://github.com/orgs/giantswarm/projects/273
Apache License 2.0
3 stars 0 forks source link

Make silences expire by default and limit their validity period #739

Closed alex-dabija closed 1 year ago

alex-dabija commented 2 years ago

Make silences expire by default and limit their validity period.

Background

We recently discovered that gorilla / 4muwu was silenced for more than one year and we no longer know why the cluster was silenced.

### Tasks
- [ ] https://github.com/giantswarm/silence-operator/pull/285
- [ ] https://github.com/giantswarm/roadmap/issues/2496
- [ ] https://github.com/giantswarm/roadmap/issues/2497
AverageMarcus commented 2 years ago

Some thoughts during planning:

TheoBrigitte commented 1 year ago

Obviously our GitOps approach to try and delete expired silence via automated PRs is not working, as I count 19 PRs open for expired silences today and this is a more or less constant number over time.

image

I have now crafted a new PR to add support silence expiry in silence-operator, so valid-until date will actually be used as real expiry date in alertmanager, see https://github.com/giantswarm/silence-operator/pull/285

AverageMarcus commented 1 year ago

@TheoBrigitte Do you think it would help if we added automation that added the PRs to the teams project boards so they discuss them during standup?

PR: https://github.com/giantswarm/silences/pull/885

TheoBrigitte commented 1 year ago

@TheoBrigitte Do you think it would help if we added automation that added the PRs to the teams project boards so they discuss them during standup?

Sure that's an improvement, let's see how this works out.

AverageMarcus commented 1 year ago

Turns out that's not possible (at least how I implemented it) because it's a bot opening / assigning the PR which doesn't trigger actions. 😞 I'm going to revert the PR adding the action.

TheoBrigitte commented 1 year ago

I did start this PR 2 weeks ago, in order to actually have the expiry date supported by silence-operator meaning silences will actually expired after a given time. I think having expiry date from silences being actual expiry date would make people react and update their silences. Because the problem now is that you set a fake expiry date, which is mainly informational but never respected as silences are valid forever. Not respecting silences expiry date also has the drawback that some silences could be left abandoned and hide real problems from us.

AverageMarcus commented 1 year ago

Do you think it'd be possible to have the delete only performed during working hours? I wouldn't want the silence (incorrectly) expiring during the night and waking someone up when there's nothing for them to do.

TheoBrigitte commented 1 year ago

Do you think it'd be possible to have the delete only performed during working hours? I wouldn't want the silence (incorrectly) expiring during the night and waking someone up when there's nothing for them to do.

Yes that's possible, that was one of the suggestion there. We just have to choose what is the time we expire silences then.

nprokopic commented 1 year ago

Do you think it'd be possible to have the delete only performed during working hours?

During whose working hours? :grin: It's gonna be tricky, given that we already have folks from American continents, via Europe, via Nepal, etc, all the way to South Korea.

AverageMarcus commented 1 year ago

We still have company business hours as central European because of customer obligations. At least last I heard that was the case.

nprokopic commented 1 year ago

We still have company business hours as central European because of customer obligations. At least last I heard that was the case.

Yeah, I believe that is the case currently. I meant more how we can do this without leaving out non-European folks that are also doing on-call, regardless of any official business hours.

I guess silences expiring during European business hours would be fine, as long as there are European folks doing business hours on-call, so they can react if needed, and we do not wake up by design on-call people in South/North America or Asia.

TheoBrigitte commented 1 year ago

We can set the default expiry hour to be 9:00 and default timezone to be Berlin, and allow to customize it via the valid-until label. Something like:

WDYT ?

nprokopic commented 1 year ago

Sounds good, yeah, can't think of a better way TBH.

One detail though, since we own silence-operator and Silence CRD, can we add ValidUntil field in the Silence CRD, instead of adding annotation? Is there some specific reason to use the annotation, or that was just an ad-hoc solution that we started using over time as we didn't have ValidUntil field?

TheoBrigitte commented 1 year ago

That's an ad-hoc solution we are just living with. But I agree we should make this a concrete field.

QuentinBisson commented 1 year ago

I would rather we use a standard for dates (i.e. https://en.wikipedia.org/wiki/ISO_8601) that some custom ones though. I'll get to that

QuentinBisson commented 1 year ago

PR is here if someone wants to give it a look https://github.com/giantswarm/silence-operator/pull/285

QuentinBisson commented 1 year ago

Released in 0.9.0

TheoBrigitte commented 1 year ago

Re-opening as I added 2 tasks for validation and alerting.

QuentinBisson commented 1 year ago

Both tasks are done, closing