ministryofjustice / nvvs-devops

Documentation for the NVVS DevOps Team
https://ministryofjustice.github.io/nvvs-devops
MIT License
4 stars 0 forks source link

Pager Duty Alert #416

Closed tomwells98 closed 1 year ago

tomwells98 commented 1 year ago

A Pagery Duty alert was sent to KY. The issue is resolved but we need to work out what caused the alert. The alert was as follows:

Labels:

tomwells98 commented 1 year ago

KY followed these trouble shooting steps: Clicked on the Grafana link and page was not loading. Then went through the following troubleshooting step: 1) Checked EKS cluster - services running. 2) Checked Load Balancer in AWS - Exists and Active. 3) Checked Route 53 - CNAME entry for monitoring-alerting.staff.service.justice.gov.uk was missing. Ran Github Actions pipeline hoping it will reinstate it, but then remembered the Route 53 hosted zone is manually managed. Manually (re)created CNAME record for monitoring-alerting.staff.service.justice.gov.uk pointing to the load balancer. 5) Can now access Grafana. Graphs show metrics are actually coming through. 6) AlertManager showing the concerning alert has stopped firing. 4) @Aaron looked at CloudTrail logs to see if we can find what caused the deletion the CNAME from Route53. Cannot ascertain the cause so one for the team to investigate in the morning.

juddin927 commented 1 year ago

{ "eventVersion": "1.08", "userIdentity": { "type": "AssumedRole", "principalId": "AROAQRJYEWZGJNZKXUMLI:1691092247607365746", "arn": "arn:aws:sts::037161842252:assumed-role/nvvs-devops-monitor-eks-dare-ExternalDNSRole/1691092247607365746", "accountId": "037161842252", "accessKeyId": "ASIAQRJYEWZGCJNJXZBT", "sessionContext": { "sessionIssuer": { "type": "Role", "principalId": "AROAQRJYEWZGJNZKXUMLI", "arn": "arn:aws:iam::037161842252:role/nvvs-devops-monitor-eks-dare-ExternalDNSRole", "accountId": "037161842252", "userName": "nvvs-devops-monitor-eks-dare-ExternalDNSRole" }, "webIdFederationData": { "federatedProvider": "arn:aws:iam::037161842252:oidc-provider/oidc.eks.eu-west-2.amazonaws.com/id/FDE5F1EFE6A12ABB2B54284131731AB3", "attributes": {} }, "attributes": { "creationDate": "2023-08-03T19:50:47Z", "mfaAuthenticated": "false" } } }, "eventTime": "2023-08-03T19:54:50Z", "eventSource": "route53.amazonaws.com", "eventName": "ChangeResourceRecordSets", "awsRegion": "us-east-1", "sourceIPAddress": "51.149.250.100", "userAgent": "aws-sdk-go/1.44.136 (go1.19.4; linux; amd64)", "requestParameters": { "hostedZoneId": "Z06002415PZ7SU3SAQ3E", "changeBatch": { "changes": [ { "action": "DELETE", "resourceRecordSet": { "name": "mojo-ima-ext-dns-cname-monitoring-alerting.staff.service.justice.gov.uk", "type": "TXT", "tTL": 300, "resourceRecords": [ { "value": "\"heritage=external-dns,external-dns/owner=monitoring-alerting,external-dns/resource=service/ingress-nginx/ingress-nginx-controller\"" } ] } }, { "action": "DELETE", "resourceRecordSet": { "name": "mojo-ima-ext-dns-monitoring-alerting.staff.service.justice.gov.uk", "type": "TXT", "tTL": 300, "resourceRecords": [ { "value": "\"heritage=external-dns,external-dns/owner=monitoring-alerting,external-dns/resource=service/ingress-nginx/ingress-nginx-controller\"" } ] } }, { "action": "DELETE", "resourceRecordSet": { "name": "monitoring-alerting.staff.service.justice.gov.uk", "type": "A", "aliasTarget": { "hostedZoneId": "ZD4D7Y8KGAS4G", "dNSName": "a5cc0ba12fd404673bf0069413ac177b-b550f1e6e6272c6e.elb.eu-west-2.amazonaws.com", "evaluateTargetHealth": true } } } ] } }, "responseElements": { "changeInfo": { "id": "/change/C06131388EVYYCD4UTU", "status": "PENDING", "submittedAt": "Aug 3, 2023 7:54:49 PM" } }, "additionalEventData": { "Note": "Do not use to reconstruct hosted zone" }, "requestID": "770f3b0c-a12e-48f2-97b7-d3454b5d78e3", "eventID": "9070146d-1650-45f7-96e7-0faecb3754a4", "readOnly": false, "eventType": "AwsApiCall", "apiVersion": "2013-04-01", "managementEvent": true, "recipientAccountId": "037161842252", "eventCategory": "Management", "tlsDetails": { "tlsVersion": "TLSv1.2", "cipherSuite": "ECDHE-RSA-AES128-GCM-SHA256", "clientProvidedHostHeader": "route53.amazonaws.com" } }

juddin927 commented 1 year ago

I see multiple create and delete operation using the role above

juddin927 commented 1 year ago

aws cloudtrail lookup-events --lookup-attributes AttributeKey=EventName,AttributeValue=ChangeResourceRecordSets --region us-east-1 | grep "DELETE"| grep "monitoring-alerting.staff.service.justice.gov.uk"