Closed digitalali-moj closed 7 months ago
Hi @digitalali-moj , You can acheive this by configuring cloudwatch alarms. Recently one of the team implemented the Cloudwatch alarms which can be found here: https://github.com/ministryofjustice/cloud-platform-environments/blob/main/namespaces/live.cloud-platform.service.justice.gov.uk/hmpps-integration-api-dev/resources/api_gateway.tf#L275-L357. We havent completed the user guide to how you can configure but this code reference should give an idea how to configure for namespaces in your services.
Service name
Track-a-query (May affect other services too)
Service environment
Impact on the service
Currently we have alerts for pod/job failures, but we seem to have a common issue on Redis which is that it doesn't flush out jobs, so they build up. This means that after a certain fresh hold we run out of memory on Redis and then we get failures. If we can get alerted before this happens then we can got and clear out the jobs and also be more ready for any issues, so the live service doesn't get affected.
Problem description
Currently the problems are:
Redis is not flushing out jobs after completion, or at least not all of them, so the queue size increases. This can be seen here: https://grafana.live.cloud-platform.service.justice.gov.uk/d/nK7rpiQZk/aws-elasticache-redis?orgId=1&var-datasource=Cloudwatch&var-region=eu-west-2&var-cacheclusterId=cp-0198cf4c888875ac-002&var-cachenodeid=0001&from=now-1y&to=now
We can't set thresholds for monitoring with automatic alerting for Cloudwatch without some configuration on AWS exporter and maybe some other moving parts.
Contact person
javid.ali@digital.justice.gov.uk