Set up Cloud Platform Alerting for DataHub

seanprivett commented 6 months ago

Follows on from ministryofjustice/data-catalogue#28

Set up monitoring-based alerting for:

DataHub application
Kafka

PagerDuty

Consider (if time) adding PagerDuty as a middleware broker for what/where/when. Even if we're not actually implementing an on-call rota.

Requirements

[x] Define and deploy custom alerts with PrometheusRule
- [ ] high latency
- [x] errors
- [x] resource depletion
- [ ] stalled data ingestion
- [ ] security anomalies
[x] Set up pingdom to alert if either of our main services (or important endpoints) go down

tom-webber commented 4 months ago

Add PrometheusRule alerts for the DataHub namespaces in Cloud Platform, pointing at the aws resources (rds, opensearch), resource usage metrics (for the datahub-gms pod), deployment metrics, pod status metrics (out of memory, crashloop backoff, frequent restarts), ingress metrics (modsecurity blocking events, servicing error responses).

Opensearch metrics may not be suffificient, as during a recent bottleneck event, Opensearch was unresponsive but Prometheus metrics were absent during the period of downtime.

CP have recently put together a terraform opensearch cloudwatch alarm module, but it is not yet suitable for use in user namespaces, as there’s no mechanism to instantiate infrastructure for exporting events to

Create an sns topic
Deliver the events to SNS to be picked up by the reporting infrastructure

tom-webber commented 4 months ago

CP issue for adding routes for alertmanager (resolved)

CP PR for PrometheusRules once alertmanager route is added (merged)

CP issue for adding Pingdom ID for slack integration

ministryofjustice / data-catalogue

Set up Cloud Platform Alerting for DataHub #29

PagerDuty

Requirements