ministryofjustice / data-catalogue

Data catalogue • This repository is defined and managed in Terraform
MIT License
2 stars 0 forks source link

Set up Cloud Platform Alerting for DataHub #29

Closed seanprivett closed 4 months ago

seanprivett commented 6 months ago

Follows on from ministryofjustice/data-catalogue#28

Set up monitoring-based alerting for:

Cloud Platform guide to observability

PagerDuty

Consider (if time) adding PagerDuty as a middleware broker for what/where/when. Even if we're not actually implementing an on-call rota.

Requirements

tom-webber commented 4 months ago

Add PrometheusRule alerts for the DataHub namespaces in Cloud Platform, pointing at the aws resources (rds, opensearch), resource usage metrics (for the datahub-gms pod), deployment metrics, pod status metrics (out of memory, crashloop backoff, frequent restarts), ingress metrics (modsecurity blocking events, servicing error responses).

Opensearch metrics may not be suffificient, as during a recent bottleneck event, Opensearch was unresponsive but Prometheus metrics were absent during the period of downtime.

CP have recently put together a terraform opensearch cloudwatch alarm module, but it is not yet suitable for use in user namespaces, as there’s no mechanism to instantiate infrastructure for exporting events to

tom-webber commented 4 months ago

CP issue for adding routes for alertmanager (resolved)

CP PR for PrometheusRules once alertmanager route is added (merged)

CP issue for adding Pingdom ID for slack integration