Closed alex-dabija closed 1 year ago
Some thoughts during planning:
Obviously our GitOps approach to try and delete expired silence via automated PRs is not working, as I count 19 PRs open for expired silences today and this is a more or less constant number over time.
I have now crafted a new PR to add support silence expiry in silence-operator, so valid-until date will actually be used as real expiry date in alertmanager, see https://github.com/giantswarm/silence-operator/pull/285
@TheoBrigitte Do you think it would help if we added automation that added the PRs to the teams project boards so they discuss them during standup?
@TheoBrigitte Do you think it would help if we added automation that added the PRs to the teams project boards so they discuss them during standup?
Sure that's an improvement, let's see how this works out.
Turns out that's not possible (at least how I implemented it) because it's a bot opening / assigning the PR which doesn't trigger actions. 😞 I'm going to revert the PR adding the action.
I did start this PR 2 weeks ago, in order to actually have the expiry date supported by silence-operator meaning silences will actually expired after a given time. I think having expiry date from silences being actual expiry date would make people react and update their silences. Because the problem now is that you set a fake expiry date, which is mainly informational but never respected as silences are valid forever. Not respecting silences expiry date also has the drawback that some silences could be left abandoned and hide real problems from us.
Do you think it'd be possible to have the delete only performed during working hours? I wouldn't want the silence (incorrectly) expiring during the night and waking someone up when there's nothing for them to do.
Do you think it'd be possible to have the delete only performed during working hours? I wouldn't want the silence (incorrectly) expiring during the night and waking someone up when there's nothing for them to do.
Yes that's possible, that was one of the suggestion there. We just have to choose what is the time we expire silences then.
Do you think it'd be possible to have the delete only performed during working hours?
During whose working hours? :grin: It's gonna be tricky, given that we already have folks from American continents, via Europe, via Nepal, etc, all the way to South Korea.
We still have company business hours as central European because of customer obligations. At least last I heard that was the case.
We still have company business hours as central European because of customer obligations. At least last I heard that was the case.
Yeah, I believe that is the case currently. I meant more how we can do this without leaving out non-European folks that are also doing on-call, regardless of any official business hours.
I guess silences expiring during European business hours would be fine, as long as there are European folks doing business hours on-call, so they can react if needed, and we do not wake up by design on-call people in South/North America or Asia.
We can set the default expiry hour to be 9:00 and default timezone to be Berlin, and allow to customize it via the valid-until
label. Something like:
valid-until: "2023-03-21"
: expires on 2023-03-21 at 09:00 CET > 08:00 UTCvalid-until: "2023-03-21" 09:00
: expires on 2023-03-21 at 09:00 CET > 08:00 UTCvalid-until: "2023-03-21 09:00 MST"
expires on 2023-03-21 at 09:00 MST > 15:00 UTCWDYT ?
Sounds good, yeah, can't think of a better way TBH.
One detail though, since we own silence-operator and Silence
CRD, can we add ValidUntil
field in the Silence
CRD, instead of adding annotation? Is there some specific reason to use the annotation, or that was just an ad-hoc solution that we started using over time as we didn't have ValidUntil
field?
That's an ad-hoc solution we are just living with. But I agree we should make this a concrete field.
I would rather we use a standard for dates (i.e. https://en.wikipedia.org/wiki/ISO_8601) that some custom ones though. I'll get to that
PR is here if someone wants to give it a look https://github.com/giantswarm/silence-operator/pull/285
Released in 0.9.0
Re-opening as I added 2 tasks for validation and alerting.
Both tasks are done, closing
Make silences expire by default and limit their validity period.
Background
We recently discovered that
gorilla / 4muwu
was silenced for more than one year and we no longer know why the cluster was silenced.