ministryofjustice / data-catalogue

Data catalogue • This repository is defined and managed in Terraform
MIT License
2 stars 0 forks source link

setup prometheus alerts and pingdom for prod datahub/fmd #187

Closed LavMatt closed 1 month ago

LavMatt commented 1 month ago

Once the prod namespaces exist we need to setup alerting

Creating your own custom alerts

MatMoore commented 1 month ago

prometheus rules are already defined for Find MOJ data prod: https://github.com/ministryofjustice/cloud-platform-environments/tree/main/namespaces/live.cloud-platform.service.justice.gov.uk/data-platform-find-moj-data-prod

but they're missing for datahub prod https://github.com/ministryofjustice/cloud-platform-environments/tree/main/namespaces/live.cloud-platform.service.justice.gov.uk/data-platform-datahub-catalogue-prod

MatMoore commented 1 month ago

Draft PR https://github.com/ministryofjustice/cloud-platform-environments/pull/24587

But first need cloud platform to create the prod severity level

We also need RDS alerting for FMD prod; see the datahub examples here https://github.com/ministryofjustice/cloud-platform-environments/blob/fdf75d3ca1123d91113861726d67d057f398230b/namespaces/live.cloud-platform.service.justice.gov.uk/data-platform-datahub-catalogue-dev/05-prometheusrule.yaml#L12

MatMoore commented 1 month ago

I will wait until https://github.com/ministryofjustice/find-moj-data/issues/546 is done before setting up pingdom, as that way we can point it at the domain people will use, rather than the cloud platform one.

MatMoore commented 1 month ago

https://github.com/ministryofjustice/cloud-platform/issues/5900 is the issue to request alert routes. This hasn't been picked up yet, so will come back to the alert manager PR when that's ready.

Route 53 PRs are

dev: https://github.com/ministryofjustice/cloud-platform-environments/pull/24649 test: https://github.com/ministryofjustice/cloud-platform-environments/pull/24654 preprod: https://github.com/ministryofjustice/cloud-platform-environments/pull/24655 prod: https://github.com/ministryofjustice/cloud-platform-environments/pull/24646

The non-prod ones are merged... so next step is to check the new secret and get in touch with operations engineering.

MatMoore commented 1 month ago

Pingdom is set up to alert to slack https://my.pingdom.com/reports/uptime#check=13492357&daterange=7days&tab=uptime_tab&checkName=Finnd%20MOJ%20Data%20-%20Homepage

MatMoore commented 1 month ago

Prometheus alerts are now defined:

kubectl get prometheusrule -n data-platform-datahub-catalogue-prod
NAME                                   AGE
prometheus-custom-rules-datahub-prod   3h55m

kubectl get prometheusrule -n data-platform-find-moj-data-prod
NAME                                         AGE
prometheus-custom-rules-find-moj-data-prod   20d
MatMoore commented 1 month ago

Seems like the alert routes are working now, as we have prod messages in slack:

(naming conventions are a bit weird, but I decided to make it consistent with the other environments)