Update CTMS Alerts - Githubissues

data-sync-user commented 2 years ago

There was an incident in production in November where the Acoustic Sync service was no longer processing contacts. This was noticed by [~accountid:5b1b2a8427631840ea300f91] when looking at the size of the database table.

The first attempt at an alert fired several times (11/17, 11/23, 11/24, 11/26, and 11/29). These appear to be due to a slowdown in Acoustic processing, and no Mozilla SRE action could be done to speed it up. It was also the wrong kind of alert for the November incident, where the issue was the backlog metric was not updated, rather than that is was too high.

The goal is:

Monitor the backlog without alerting on-call SREs due to operational issues outside of Mozilla’s control
Add monitoring as suggested by SREs and managers
Add monitoring that would more reliably detect the November incident and can be used to automate mitigation.

┆Issue is synchronized with this Jira Story

data-sync-user commented 2 years ago

➤ John Whitlock commented:

Changes so far:

Alan Alexander lengthened the alert period for the Acoustic backlog, so that SREs will not get alerted for the periodic slowdown in Acoustic processing.

https://github.com/mozilla-it/ctms-api/pull/307 ( https://github.com/mozilla-it/ctms-api/pull/307|smart-link ) included several changes:

New counter ctms_pending_acoustic_sync_total for Benson Wong's [request to see the fill rate into the backlog.|https://mozilla.slack.com/archives/C02EZ8U0FMF/p1637685553051000?thread_ts=1637680025.019400&cid=C02EZ8U0FMF - T]
New counter ctms_background_acoustic_sync_loops to check if the sync process is still running
New gauge ctms_background_acoustic_sync_age_s for Rowan Green's [request to see the age of items in the queue.|https://mozilla.slack.com/archives/C02EZ8U0FMF/p1637773312091900?thread_ts=1637762867.068600&cid=C02EZ8U0FMF - ]
The sync process doesn’t sleep if there is a backlog greater than the batch size, also requested by Rowan Green.

https://github.com/mozilla-it/ctms-api/pull/308 ( https://github.com/mozilla-it/ctms-api/pull/308|smart-link ) (in review) adds code that can write the current time to a file, and a script that will read it and see how long ago it is. This can be used as a liveness check ( https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/ ) in Kubernetes, and automatically restart the service if paused. This was suggested ( https://mozilla.slack.com/archives/G01M9NUSUG1/p1638295330202900 ) by Alan Alexander.

Once that is merged, we’ll need to update the deployed configuration:

Add BACKGROUND_HEALTHCHECK_PATH and BACKGROUND_HEALTHCHECK_AGE_S
Add a liveness check that calls python3 /app/ctms/bin/healthcheck_sync.py
(Maybe) Add a startup check to do the same

Alan Alexander is working on different ways to freeze up the sync service, and finding one that matches the November incident, so we can check these metrics and mitigations.

data-sync-user commented 2 years ago

➤ John Whitlock commented:

Benson Wong also suggested ( https://mozilla.slack.com/archives/G01M9NUSUG1/p1638380946222800 ) that there may be timeouts that can be applied to the database connection. Research is needed to see what client-side timeouts are available for database connections. There are also external API calls to the metrics push gateway and to Acoustic that might have an optional timeout.

data-sync-user commented 2 years ago

➤ John Whitlock commented:

Additional stuff:

Brett Kochendorfer was concerned about repeated log messages like Skipping CTMS field (email, update_timestamp) because no match in Acoustic. These were refactored away in https://github.com/mozilla-it/ctms-api/pull/298 ( https://github.com/mozilla-it/ctms-api/pull/298|smart-link ) and later commits, and will be deployed along with other changes on Dec 6.
Rowan Green would like [to see Acoustic and Cinchy response times on the dashboard| https://mozilla.slack.com/archives/C02EZ8U0FMF/p1637680877026500?thread_ts=1637680025.019400&cid=C02EZ8U0FMF - T]. I’ve spent some time trying to get our timing data displayed, but I’m still fighting Graphana.
Rowan Green thinks a “max age” threshold, based on how long it takes to sync a contact, would be more useful than the size of the backlog in contacts, and thinks it would make a better alert metric ( https://mozilla.slack.com/archives/C02EZ8U0FMF/p1637780190102600 ).

I’m focusing on the code changes at the moment, and may delay dashboard changes until after the Dec 6 launch.

mozilla-it / ctms-api

Update CTMS Alerts #316