There will be a few small changes to the alarms as part of this migration - most notably:
We no longer provision these alarms for CODE, since having alarms for CODE generally creates noise that doesn't require an action (i.e. we expect to break CODE more frequently than PROD since 'real' users do not rely on it!). We could add a CODE alarm via slo-alerts if the team think it's really useful, but I don't think it's necessary now that we're no longer testing this new approach.
We now use a calculation for valid events, rather than relying on the RequestCount metric, as we found that the latter produced some unexpected/undesirable results (most notably, it wasn't incremented in certain 5XX error scenarios!)
I've deployed this to CODE and confirmed that the only deletions are related to alarms (which is desirable in this case):
N.B. launch config deletion at 10:51:49 relates to a previous deployment (AMI update)
How can we measure success?
Our (PROD) alarm coverage will be the same once this PR and https://github.com/guardian/slo-alerts/pull/19 have been merged, but future updates/improvements (implemented by @guardian/devx-reliability) will be picked up automatically rather than requiring a GuCDK update.
Have we considered potential risks?
There is a small risk that we've made a mistake during the migration, but I have double checked that the config provisioned via https://github.com/guardian/slo-alerts/pull/19 is correct now that it has been deployed.
What does this change?
These alarms (which were first introduced via https://github.com/guardian/mobile-n10n/pull/855) are going to be provisioned via a different repository.
There will be a few small changes to the alarms as part of this migration - most notably:
CODE
, since having alarms forCODE
generally creates noise that doesn't require an action (i.e. we expect to breakCODE
more frequently thanPROD
since 'real' users do not rely on it!). We could add aCODE
alarm viaslo-alerts
if the team think it's really useful, but I don't think it's necessary now that we're no longer testing this new approach.valid events
, rather than relying on theRequestCount
metric, as we found that the latter produced some unexpected/undesirable results (most notably, it wasn't incremented in certain 5XX error scenarios!)See https://github.com/guardian/slo-alerts/pull/19 for more details.
How to test
I've deployed this to
CODE
and confirmed that the only deletions are related to alarms (which is desirable in this case):N.B. launch config deletion at 10:51:49 relates to a previous deployment (AMI update)
How can we measure success?
Our (
PROD
) alarm coverage will be the same once this PR and https://github.com/guardian/slo-alerts/pull/19 have been merged, but future updates/improvements (implemented by @guardian/devx-reliability) will be picked up automatically rather than requiring a GuCDK update.Have we considered potential risks?
There is a small risk that we've made a mistake during the migration, but I have double checked that the config provisioned via https://github.com/guardian/slo-alerts/pull/19 is correct now that it has been deployed.