High-availability Production Configuration

bnewbold commented 5 years ago

The PostgreSQL database setup should be configured for warm failover to a secondary machine in a different data center. This will mostly be a matter of testing and documenting operational procedures; we already have the hardware resources allocated.

HAProxy (or equivalent) should be deployed in from of the API, webface, and search API, with appropriate rate-limits and monitoring. Health-checking should enable relatively seamless failover between servers.

Nagios and other alerts should be in place to notify about disk space, SSL certificate expiration, and other basic system monitoring.

A public status page (lambstatus?) and offsite alerting service (cabot?) should be deployed.

The "help wanted" aspect of this issue is advice arounnd postgres best practices/monitoring, Kafka monitoring (eg, how to integrate kafka topic lag metrics in statsd/grafana), simple log management (eg, retain error lines longer than regular logs), and basic alerting on systemd daemon status (eg, if a python worker daemon crashes, send email with last few log lines).

bnewbold commented 3 years ago

Partial progress has been made:

haproxy configured in a single-node configuration
using uptimerobot for status page (https://status.fatcat.wiki) and it is working better than previous attempts (labstatus, cabot, etc)

Still to be done:

keepalived for multi-node haproxy
more monitoring and alerting
move fatcat ES usage to multi-node cluster (shared with scholar index)
postgres replication (and update to postgres 13)

bnewbold commented 3 years ago

Status update:

search is replicated, and upgraded to elasticsearch 7.10
virtual IP managed by keepalived is configured with a second haproxy node. have not switched over DNS to the virtual IP yet
have not updated to postgresql 13 or configured postgresql replication yet, but hardware is ready for it
need to implement "read-only mode" in web interface and API (or just figure out the correct configuration for this behavior)

bnewbold commented 2 years ago

More updates:

upgraded to postgresql 13
upgraded to ubuntu focal
"read-only mode" banner and database configuration documented

The last remaining step is adding hot-standby (but non-synchronous) binary-mode (WAL) postgresql replication, with the replica being read-only by default (manual failover).

internetarchive / fatcat

High-availability Production Configuration #16