Open cclauss opened 1 year ago
It is important that these multiple servers/VMs run in different availability zones without shared location, power, upstream infrastructure, etc. Currently one of the failure modes is "Hey, we're going to shut off the power in our (only) data center for hours." With the ubiquitous availability of cloud computing services, it's cheap to build local/cloud hybrid solutions without having to invest in geographically dispersed data centers for disaster recovery.
Assignees removed automatically after 14 days.
Epic / Tracking Issue for a significant work effort.
I am always frustrated when our service is down for our users.
Describe the problem that you'd like solved
Our platform is increasingly mission-critical for users around the globe so let's leverage our Docker-based architecture to implement automatic failover of key services.
The following list is in the recommended order of implementation:
Services currently running on multiple servers
ol-mem0
,ol-mem1
,ol-mem2
runningmemcached
on bare metalol-db1
(primary) andol-db2
(backup) running Postgres on bare metalol-web1
andol-web2
running Docker containeropenlibrary-web-1
ol-solr0
andol-solr1
running Docker containeropenlibrary_solr_1
Services currently running on a single server
openlibrary-covers-1
andopenlibrary-covers-2
ol-home0
running seven different Docker containersol-www0
runninghaproxy
andnginx
Docker containersIt will be important to distinguish services that will operate in primary/backup mode (like
database
) from those which will operate in load-sharing / parallel mode (likeMemcache
). We will need to document and test the failover conditions and constraints. For example, failure of the primary database server might put the site on read-only mode on the backup server.Proposal & Constraints
Document and implement a failover approach for each of the services listed above and then use chaos monkey-like testing to ensure service resilience in the face of unplanned software, operating system, and hardware failure.
The hosts in a failover pair must be placed on different virtual machines to ensure resilience to hardware failures. This should also simplify the process of planned downtime and hardware migration while also distributing workloads among virtual machines.
Many of these services might require a two-step migration to failover. The minimum-viable-failover phase will prove basic service failover while documenting but not solving all corner cases. The full failover phase will improve automation and solve all documented corner cases.
Tracking issue
database failover:
ol-db1
goes down (i.e. auto switch tool-db2
in read-only mode)ol-db1
&ol-db2
networking so that if/whenol-db1
goes down, Open Library is able to gracefully switch tool-db2
in read-only modeopenlibrary.yml
config specifies thatol-db1
is our database so what change would enable failover?ol-db2
to take over the IP or hostname ofol-db1
in the event of an outage?database upgrade:
Stakeholders
@abezella @mekarpeles @cdrini @scottbarnes