Closed seanprivett closed 4 months ago
Add PrometheusRule alerts for the DataHub namespaces in Cloud Platform, pointing at the aws resources (rds, opensearch), resource usage metrics (for the datahub-gms
pod), deployment metrics, pod status metrics (out of memory, crashloop backoff, frequent restarts), ingress metrics (modsecurity blocking events, servicing error responses).
Opensearch metrics may not be suffificient, as during a recent bottleneck event, Opensearch was unresponsive but Prometheus metrics were absent during the period of downtime.
CP have recently put together a terraform opensearch cloudwatch alarm module, but it is not yet suitable for use in user namespaces, as there’s no mechanism to instantiate infrastructure for exporting events to
Follows on from ministryofjustice/data-catalogue#28
Set up monitoring-based alerting for:
Cloud Platform guide to observability
PagerDuty
Consider (if time) adding PagerDuty as a middleware broker for what/where/when. Even if we're not actually implementing an on-call rota.
Requirements