Closed data-sync-user closed 2 years ago
➤ John Whitlock commented:
Changes so far:
Alan Alexander lengthened the alert period for the Acoustic backlog, so that SREs will not get alerted for the periodic slowdown in Acoustic processing.
https://github.com/mozilla-it/ctms-api/pull/307 ( https://github.com/mozilla-it/ctms-api/pull/307|smart-link ) included several changes:
https://github.com/mozilla-it/ctms-api/pull/308 ( https://github.com/mozilla-it/ctms-api/pull/308|smart-link ) (in review) adds code that can write the current time to a file, and a script that will read it and see how long ago it is. This can be used as a liveness check ( https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/ ) in Kubernetes, and automatically restart the service if paused. This was suggested ( https://mozilla.slack.com/archives/G01M9NUSUG1/p1638295330202900 ) by Alan Alexander.
Once that is merged, we’ll need to update the deployed configuration:
Alan Alexander is working on different ways to freeze up the sync service, and finding one that matches the November incident, so we can check these metrics and mitigations.
➤ John Whitlock commented:
Benson Wong also suggested ( https://mozilla.slack.com/archives/G01M9NUSUG1/p1638380946222800 ) that there may be timeouts that can be applied to the database connection. Research is needed to see what client-side timeouts are available for database connections. There are also external API calls to the metrics push gateway and to Acoustic that might have an optional timeout.
➤ John Whitlock commented:
Additional stuff:
I’m focusing on the code changes at the moment, and may delay dashboard changes until after the Dec 6 launch.
There was an incident in production in November where the Acoustic Sync service was no longer processing contacts. This was noticed by [~accountid:5b1b2a8427631840ea300f91] when looking at the size of the database table.
The first attempt at an alert fired several times (11/17, 11/23, 11/24, 11/26, and 11/29). These appear to be due to a slowdown in Acoustic processing, and no Mozilla SRE action could be done to speed it up. It was also the wrong kind of alert for the November incident, where the issue was the backlog metric was not updated, rather than that is was too high.
The goal is:
Monitor the backlog without alerting on-call SREs due to operational issues outside of Mozilla’s control
Add monitoring as suggested by SREs and managers
Add monitoring that would more reliably detect the November incident and can be used to automate mitigation.
┆Issue is synchronized with this Jira Story