Closed YvanMOJdigital closed 2 weeks ago
Find out what is causing the problem and raise bugs to resolve any issues that could be triggering the alerts
This one is still awaiting review https://github.com/ministryofjustice/cloud-platform-environments/pull/26417
The others I haven't been able to investigate any further yet.
Also, jn the process of following https://runbooks.cloud-platform.service.justice.gov.uk/debugging-aws-console-access.html I removed the data-platform-labs team, thinking it wasn't in use anymore, but this is still in use by the Find MoJ data cloud platform namespaces, so access is broken at the moment. Cloud platform have said they should be able to fix this tomorrow (https://mojdt.slack.com/archives/C57UPMZLY/p1727794576675889)
Another issue here is that the runbook link on all the alerts doesn't work, and some of the dashboard links don't seem to populate the environment filter.
For every alert, we should make sure that
KubeDeploymentReplicasMismatch
Fired today with
- alertname: KubeDeploymentReplicasMismatch
- clusterName: live
- container: kube-state-metrics
- deployment: datahub-acryl-datahub-actions
- endpoint: http
- instance: 172.20.175.1:8080
- job: kube-state-metrics
- namespace: data-platform-datahub-catalogue-test
- pod: prometheus-operator-kube-state-metrics-6dd868784f-kdqdd
- prometheus: monitoring/prometheus-operator-kube-p-prometheus
- service: prometheus-operator-kube-state-metrics
- severity: datahub_test
Expression
kube_deployment_spec_replicas{job="kube-state-metrics",namespace="data-platform-datahub-catalogue-test"} != kube_deployment_status_replicas_available{job="kube-state-metrics",namespace="data-platform-datahub-catalogue-test"}
This compares
How do we debug this?
describe the deployment
describe the replicaset
│ Events: │
│ Type Reason Age From Message │
│ ---- ------ ---- ---- ------- │
│ Warning FailedCreate 3m46s (x297 over 3d3h) replicaset-controller Error creating: pods "datahub-acryl-datahub-actions-887f87dff-" is forbidden: │
│ error looking up service account data-platform-datahub-catalogue-test/data-platform-test: serviceaccount "data-platform-test" not found
This seems like it's caused by the retiring of the data-platform github group for data catalogue resources. The service account is now named data-catalogue-test
.
Configuration for datahub actions service accounts
this seems to evaluate to default https://github.com/acryldata/datahub-helm/blob/58648eb8479ccbe789f7aa391b7bc633276d0c2e/charts/datahub/subcharts/acryl-datahub-actions/templates/_helpers.tpl#L57
serviceAccount.create
defaults to false
acryl-datahub-actions.serviceAccount.name
is set to ""
but it's overriden in the command line https://github.com/ministryofjustice/data-catalogue/pull/11/files#diff-7723d0f09c18e7709388ae420eaffcbe1986195fc2f9bce3f300b7bbb4fcb4aaR163
So in this case we probably need to rename the value of IRSA_SA.
https://github.com/ministryofjustice/data-catalogue/actions/runs/11218554264 this one needs rolling out to prod still
I have increased the evaluation period from 1 to 5 minutes for the Elasticsearch cluster free storage space alert
https://github.com/ministryofjustice/cloud-platform-environments/pull/26677 https://github.com/ministryofjustice/cloud-platform-environments/pull/26678 https://github.com/ministryofjustice/cloud-platform-environments/pull/26679 https://github.com/ministryofjustice/cloud-platform-environments/pull/26680
We should maybe also include the environment in the alert name to be clear what affects real users
At the moment it's buried in the details so you can't see what environment it's for unless you click "show more" in the slack alert.
Our alerts channel is very busy right now due to some prometheus alerts getting triggered a lot: GMSCpuUsageHigh RDSLowStorage ElasticSearchClusterFreeStorageSpace
These alerts might signify a problem, or they might be overly conservative and need loosening.