Open bnewbold opened 5 years ago
Partial progress has been made:
Still to be done:
Status update:
More updates:
The last remaining step is adding hot-standby (but non-synchronous) binary-mode (WAL) postgresql replication, with the replica being read-only by default (manual failover).
The PostgreSQL database setup should be configured for warm failover to a secondary machine in a different data center. This will mostly be a matter of testing and documenting operational procedures; we already have the hardware resources allocated.
HAProxy (or equivalent) should be deployed in from of the API, webface, and search API, with appropriate rate-limits and monitoring. Health-checking should enable relatively seamless failover between servers.
Nagios and other alerts should be in place to notify about disk space, SSL certificate expiration, and other basic system monitoring.
A public status page (lambstatus?) and offsite alerting service (cabot?) should be deployed.
The "help wanted" aspect of this issue is advice arounnd postgres best practices/monitoring, Kafka monitoring (eg, how to integrate kafka topic lag metrics in statsd/grafana), simple log management (eg, retain error lines longer than regular logs), and basic alerting on systemd daemon status (eg, if a python worker daemon crashes, send email with last few log lines).