internetarchive / fatcat

Perpetual Access To The Scholarly Record
https://guide.fatcat.wiki
Other
114 stars 18 forks source link

High-availability Production Configuration #16

Open bnewbold opened 5 years ago

bnewbold commented 5 years ago

The PostgreSQL database setup should be configured for warm failover to a secondary machine in a different data center. This will mostly be a matter of testing and documenting operational procedures; we already have the hardware resources allocated.

HAProxy (or equivalent) should be deployed in from of the API, webface, and search API, with appropriate rate-limits and monitoring. Health-checking should enable relatively seamless failover between servers.

Nagios and other alerts should be in place to notify about disk space, SSL certificate expiration, and other basic system monitoring.

A public status page (lambstatus?) and offsite alerting service (cabot?) should be deployed.

The "help wanted" aspect of this issue is advice arounnd postgres best practices/monitoring, Kafka monitoring (eg, how to integrate kafka topic lag metrics in statsd/grafana), simple log management (eg, retain error lines longer than regular logs), and basic alerting on systemd daemon status (eg, if a python worker daemon crashes, send email with last few log lines).

bnewbold commented 3 years ago

Partial progress has been made:

Still to be done:

bnewbold commented 3 years ago

Status update:

bnewbold commented 2 years ago

More updates:

The last remaining step is adding hot-standby (but non-synchronous) binary-mode (WAL) postgresql replication, with the replica being read-only by default (manual failover).