Add ability to get alerts from Cloudwatch -esp including Redis memory and threads

Service name

Track-a-query (May affect other services too)

Service environment

[ ] Dev / Development
[ ] Staging
[x] Prod / Production
[ ] Other

Impact on the service

Currently we have alerts for pod/job failures, but we seem to have a common issue on Redis which is that it doesn't flush out jobs, so they build up. This means that after a certain fresh hold we run out of memory on Redis and then we get failures. If we can get alerted before this happens then we can got and clear out the jobs and also be more ready for any issues, so the live service doesn't get affected.

Problem description

Currently the problems are:

Redis is not flushing out jobs after completion, or at least not all of them, so the queue size increases. This can be seen here: https://grafana.live.cloud-platform.service.justice.gov.uk/d/nK7rpiQZk/aws-elasticache-redis?orgId=1&var-datasource=Cloudwatch&var-region=eu-west-2&var-cacheclusterId=cp-0198cf4c888875ac-002&var-cachenodeid=0001&from=now-1y&to=now

We can't set thresholds for monitoring with automatic alerting for Cloudwatch without some configuration on AWS exporter and maybe some other moving parts.

Contact person

javid.ali@digital.justice.gov.uk

ministryofjustice / cloud-platform