Closed tomwells98 closed 1 year ago
KY followed these trouble shooting steps:
Clicked on the Grafana link and page was not loading. Then went through the following troubleshooting step:
1) Checked EKS cluster - services running.
2) Checked Load Balancer in AWS - Exists and Active.
3) Checked Route 53 - CNAME entry for monitoring-alerting.staff.service.justice.gov.uk
was missing.
Ran Github Actions pipeline hoping it will reinstate it, but then remembered the Route 53 hosted zone is manually managed.
Manually (re)created CNAME record for monitoring-alerting.staff.service.justice.gov.uk
pointing to the load balancer.
5) Can now access Grafana. Graphs show metrics are actually coming through.
6) AlertManager showing the concerning alert has stopped firing.
4)
@Aaron
looked at CloudTrail logs to see if we can find what caused the deletion the CNAME from Route53. Cannot ascertain the cause so one for the team to investigate in the morning.
{ "eventVersion": "1.08", "userIdentity": { "type": "AssumedRole", "principalId": "AROAQRJYEWZGJNZKXUMLI:1691092247607365746", "arn": "arn:aws:sts::037161842252:assumed-role/nvvs-devops-monitor-eks-dare-ExternalDNSRole/1691092247607365746", "accountId": "037161842252", "accessKeyId": "ASIAQRJYEWZGCJNJXZBT", "sessionContext": { "sessionIssuer": { "type": "Role", "principalId": "AROAQRJYEWZGJNZKXUMLI", "arn": "arn:aws:iam::037161842252:role/nvvs-devops-monitor-eks-dare-ExternalDNSRole", "accountId": "037161842252", "userName": "nvvs-devops-monitor-eks-dare-ExternalDNSRole" }, "webIdFederationData": { "federatedProvider": "arn:aws:iam::037161842252:oidc-provider/oidc.eks.eu-west-2.amazonaws.com/id/FDE5F1EFE6A12ABB2B54284131731AB3", "attributes": {} }, "attributes": { "creationDate": "2023-08-03T19:50:47Z", "mfaAuthenticated": "false" } } }, "eventTime": "2023-08-03T19:54:50Z", "eventSource": "route53.amazonaws.com", "eventName": "ChangeResourceRecordSets", "awsRegion": "us-east-1", "sourceIPAddress": "51.149.250.100", "userAgent": "aws-sdk-go/1.44.136 (go1.19.4; linux; amd64)", "requestParameters": { "hostedZoneId": "Z06002415PZ7SU3SAQ3E", "changeBatch": { "changes": [ { "action": "DELETE", "resourceRecordSet": { "name": "mojo-ima-ext-dns-cname-monitoring-alerting.staff.service.justice.gov.uk", "type": "TXT", "tTL": 300, "resourceRecords": [ { "value": "\"heritage=external-dns,external-dns/owner=monitoring-alerting,external-dns/resource=service/ingress-nginx/ingress-nginx-controller\"" } ] } }, { "action": "DELETE", "resourceRecordSet": { "name": "mojo-ima-ext-dns-monitoring-alerting.staff.service.justice.gov.uk", "type": "TXT", "tTL": 300, "resourceRecords": [ { "value": "\"heritage=external-dns,external-dns/owner=monitoring-alerting,external-dns/resource=service/ingress-nginx/ingress-nginx-controller\"" } ] } }, { "action": "DELETE", "resourceRecordSet": { "name": "monitoring-alerting.staff.service.justice.gov.uk", "type": "A", "aliasTarget": { "hostedZoneId": "ZD4D7Y8KGAS4G", "dNSName": "a5cc0ba12fd404673bf0069413ac177b-b550f1e6e6272c6e.elb.eu-west-2.amazonaws.com", "evaluateTargetHealth": true } } } ] } }, "responseElements": { "changeInfo": { "id": "/change/C06131388EVYYCD4UTU", "status": "PENDING", "submittedAt": "Aug 3, 2023 7:54:49 PM" } }, "additionalEventData": { "Note": "Do not use to reconstruct hosted zone" }, "requestID": "770f3b0c-a12e-48f2-97b7-d3454b5d78e3", "eventID": "9070146d-1650-45f7-96e7-0faecb3754a4", "readOnly": false, "eventType": "AwsApiCall", "apiVersion": "2013-04-01", "managementEvent": true, "recipientAccountId": "037161842252", "eventCategory": "Management", "tlsDetails": { "tlsVersion": "TLSv1.2", "cipherSuite": "ECDHE-RSA-AES128-GCM-SHA256", "clientProvidedHostHeader": "route53.amazonaws.com" } }
I see multiple create and delete operation using the role above
aws cloudtrail lookup-events --lookup-attributes AttributeKey=EventName,AttributeValue=ChangeResourceRecordSets --region us-east-1 | grep "DELETE"| grep "monitoring-alerting.staff.service.justice.gov.uk"
A Pagery Duty alert was sent to KY. The issue is resolved but we need to work out what caused the alert. The alert was as follows:
Labels: