internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.22k stars 1.37k forks source link

Create replicas of key systems to enable automatic failover during downtime #8043

Open cclauss opened 1 year ago

cclauss commented 1 year ago

Epic / Tracking Issue for a significant work effort.

I am always frustrated when our service is down for our users.

Describe the problem that you'd like solved

Our platform is increasingly mission-critical for users around the globe so let's leverage our Docker-based architecture to implement automatic failover of key services.

The following list is in the recommended order of implementation:

Services currently running on multiple servers

It will be important to distinguish services that will operate in primary/backup mode (like database) from those which will operate in load-sharing / parallel mode (like Memcache). We will need to document and test the failover conditions and constraints. For example, failure of the primary database server might put the site on read-only mode on the backup server.

Proposal & Constraints

Document and implement a failover approach for each of the services listed above and then use chaos monkey-like testing to ensure service resilience in the face of unplanned software, operating system, and hardware failure.

The hosts in a failover pair must be placed on different virtual machines to ensure resilience to hardware failures. This should also simplify the process of planned downtime and hardware migration while also distributing workloads among virtual machines.

Many of these services might require a two-step migration to failover. The minimum-viable-failover phase will prove basic service failover while documenting but not solving all corner cases. The full failover phase will improve automation and solve all documented corner cases.

Tracking issue

database failover:

database upgrade:

Stakeholders

@abezella @mekarpeles @cdrini @scottbarnes

tfmorris commented 1 year ago

It is important that these multiple servers/VMs run in different availability zones without shared location, power, upstream infrastructure, etc. Currently one of the failure modes is "Hey, we're going to shut off the power in our (only) data center for hours." With the ubiquitous availability of cloud computing services, it's cheap to build local/cloud hybrid solutions without having to invest in geographically dispersed data centers for disaster recovery.

github-actions[bot] commented 9 months ago

Assignees removed automatically after 14 days.