ministryofjustice / cloud-platform

Documentation on the MoJ cloud platform
MIT License
87 stars 44 forks source link

Add ability to get alerts from Cloudwatch -esp including Redis memory and threads #4341

Closed digitalali-moj closed 7 months ago

digitalali-moj commented 1 year ago

Service name

Track-a-query (May affect other services too)

Service environment

Impact on the service

Currently we have alerts for pod/job failures, but we seem to have a common issue on Redis which is that it doesn't flush out jobs, so they build up. This means that after a certain fresh hold we run out of memory on Redis and then we get failures. If we can get alerted before this happens then we can got and clear out the jobs and also be more ready for any issues, so the live service doesn't get affected.

Problem description

Currently the problems are:

Redis is not flushing out jobs after completion, or at least not all of them, so the queue size increases. This can be seen here: https://grafana.live.cloud-platform.service.justice.gov.uk/d/nK7rpiQZk/aws-elasticache-redis?orgId=1&var-datasource=Cloudwatch&var-region=eu-west-2&var-cacheclusterId=cp-0198cf4c888875ac-002&var-cachenodeid=0001&from=now-1y&to=now

We can't set thresholds for monitoring with automatic alerting for Cloudwatch without some configuration on AWS exporter and maybe some other moving parts.

Contact person

javid.ali@digital.justice.gov.uk

poornima-krishnasamy commented 7 months ago

Hi @digitalali-moj , You can acheive this by configuring cloudwatch alarms. Recently one of the team implemented the Cloudwatch alarms which can be found here: https://github.com/ministryofjustice/cloud-platform-environments/blob/main/namespaces/live.cloud-platform.service.justice.gov.uk/hmpps-integration-api-dev/resources/api_gateway.tf#L275-L357. We havent completed the user guide to how you can configure but this code reference should give an idea how to configure for namespaces in your services.