EPIC: Remove single-point-of-failure db server

nickstenning commented 7 years ago

Context

At the moment we run a server called (for historical reasons) db, which runs the following services:

MySQL (for WordPress)
Redis (for WordPress)
~~RabbitMQ (for Hypothesis)~~ (We've moved this to CloudAMQP.)
~~Squid proxy (for Via)~~ (We've moved this into the Docker container.)

Currently, a failure of this server will:

render the blog unavailable
~~render via unavailable~~
~~cause major data consistency problems for Hypothesis (new annotations will not be added to the search index)~~
~~disable "real-time" annotation updates~~
~~disable outbound email from Hypothesis~~

In addition, the fact that this server is a single point of failure makes it inconvenient to take it down for security updates, and as a result it is the only server in our infrastructure which does not regularly get kernel upgrades.

Discussion

Ideally, we would remove all single points of failure from the infrastructure we use for "business-critical systems" (i.e. for annotation services).

This goal will be aided by moving WordPress to Pantheon, which will eliminate MySQL and redis from the list of services which need to be replaced before db can be shut down.

~~We can likely include squid in the Via docker container so that each instance talks to its own local proxy.~~ In light of the need to accelerate this work due to an AWS maintenance window, we've done this.

~~The last remaining service is RabbitMQ. We can either look into maintaining our own clustered RabbitMQ, or possibly moving to a managed service such as CloudAMQP.~~ In light of the need to accelerate this work due to an AWS maintenance window, we've migrated to CloudAMQP.

chdorner commented 7 years ago

Out of the 4 services we've migrated two away (Rabbit moved to a hosted provider, squid moved into the docker containers).

The other two services are only used by WordPress, @nickstenning is moving them to a new server today because we might not migrate WordPress to Pantheon before the maintenance happens on the db instance. Once the new WordPress website is up an running we can just terminate the server and the work is done.

nickstenning commented 7 years ago

This work was accelerated due to an AWS maintenance window.

MySQL and Redis are now running on a (hopefully temporary) new wpdb instance.

hypothesis / product-backlog

EPIC: Remove single-point-of-failure db server #256

Context

Discussion