ministryofjustice / find-moj-data

Find MOJ data service • This repository is defined and managed in Terraform
MIT License
5 stars 0 forks source link

Get false positive alerts under control #832

Closed YvanMOJdigital closed 2 weeks ago

YvanMOJdigital commented 1 month ago

Our alerts channel is very busy right now due to some prometheus alerts getting triggered a lot: GMSCpuUsageHigh RDSLowStorage ElasticSearchClusterFreeStorageSpace

These alerts might signify a problem, or they might be overly conservative and need loosening.

teeceeas commented 4 weeks ago

Find out what is causing the problem and raise bugs to resolve any issues that could be triggering the alerts

MatMoore commented 3 weeks ago

Latency alerts

Elasticsearch cluster free storage space

RDS low storage (datahub)

GMS CPU usage

MatMoore commented 3 weeks ago

This one is still awaiting review https://github.com/ministryofjustice/cloud-platform-environments/pull/26417

The others I haven't been able to investigate any further yet.

Also, jn the process of following https://runbooks.cloud-platform.service.justice.gov.uk/debugging-aws-console-access.html I removed the data-platform-labs team, thinking it wasn't in use anymore, but this is still in use by the Find MoJ data cloud platform namespaces, so access is broken at the moment. Cloud platform have said they should be able to fix this tomorrow (https://mojdt.slack.com/archives/C57UPMZLY/p1727794576675889)

MatMoore commented 2 weeks ago

Another issue here is that the runbook link on all the alerts doesn't work, and some of the dashboard links don't seem to populate the environment filter.

For every alert, we should make sure that

MatMoore commented 2 weeks ago

KubeDeploymentReplicasMismatch

Fired today with

  - alertname: KubeDeploymentReplicasMismatch
  - clusterName: live
  - container: kube-state-metrics
  - deployment: datahub-acryl-datahub-actions
  - endpoint: http
  - instance: 172.20.175.1:8080
  - job: kube-state-metrics
  - namespace: data-platform-datahub-catalogue-test
  - pod: prometheus-operator-kube-state-metrics-6dd868784f-kdqdd
  - prometheus: monitoring/prometheus-operator-kube-p-prometheus
  - service: prometheus-operator-kube-state-metrics
  - severity: datahub_test

Expression

kube_deployment_spec_replicas{job="kube-state-metrics",namespace="data-platform-datahub-catalogue-test"} != kube_deployment_status_replicas_available{job="kube-state-metrics",namespace="data-platform-datahub-catalogue-test"}

This compares

How do we debug this?

describe the deployment

describe the replicaset

│ Events:                                                                                                                                                │
│   Type     Reason        Age                     From                   Message                                                                        │
│   ----     ------        ----                    ----                   -------                                                                        │
│   Warning  FailedCreate  3m46s (x297 over 3d3h)  replicaset-controller  Error creating: pods "datahub-acryl-datahub-actions-887f87dff-" is forbidden:  │
│ error looking up service account data-platform-datahub-catalogue-test/data-platform-test: serviceaccount "data-platform-test" not found 

This seems like it's caused by the retiring of the data-platform github group for data catalogue resources. The service account is now named data-catalogue-test.

MatMoore commented 2 weeks ago

Configuration for datahub actions service accounts

https://github.com/acryldata/datahub-helm/blob/58648eb8479ccbe789f7aa391b7bc633276d0c2e/charts/datahub/subcharts/acryl-datahub-actions/templates/deployment.yaml#L37

this seems to evaluate to default https://github.com/acryldata/datahub-helm/blob/58648eb8479ccbe789f7aa391b7bc633276d0c2e/charts/datahub/subcharts/acryl-datahub-actions/templates/_helpers.tpl#L57

serviceAccount.create defaults to false

acryl-datahub-actions.serviceAccount.name is set to ""

but it's overriden in the command line https://github.com/ministryofjustice/data-catalogue/pull/11/files#diff-7723d0f09c18e7709388ae420eaffcbe1986195fc2f9bce3f300b7bbb4fcb4aaR163

So in this case we probably need to rename the value of IRSA_SA.

MatMoore commented 2 weeks ago

https://github.com/ministryofjustice/cloud-platform-environments/pull/26657

MatMoore commented 2 weeks ago

https://github.com/ministryofjustice/data-catalogue/actions/runs/11218554264 this one needs rolling out to prod still

mitchdawson1982 commented 2 weeks ago

I have increased the evaluation period from 1 to 5 minutes for the Elasticsearch cluster free storage space alert

https://github.com/ministryofjustice/cloud-platform-environments/pull/26677 https://github.com/ministryofjustice/cloud-platform-environments/pull/26678 https://github.com/ministryofjustice/cloud-platform-environments/pull/26679 https://github.com/ministryofjustice/cloud-platform-environments/pull/26680

MatMoore commented 2 weeks ago
MatMoore commented 2 weeks ago

We should maybe also include the environment in the alert name to be clear what affects real users

At the moment it's buried in the details so you can't see what environment it's for unless you click "show more" in the slack alert.

MatMoore commented 2 weeks ago

Follow up tickets: