rethinkdb / docs

RethinkDB documentation
http://rethinkdb.com/docs
Apache License 2.0
117 stars 167 forks source link

What to do when a server goes down #876

Open AtnNn opened 9 years ago

AtnNn commented 9 years ago

The failover documentation explains what RethinkDB will do to try to maintain availability when a server goes down.

It might be useful to mention that although the tables will become available again after automatic failover kicks in, some issues might pop up, such as "Table test.foo is available for all operations, but some replicas are not ready." Users may need to connect the server back to the cluster to regain full health, or adjust the replication settings.

chipotle commented 9 years ago

When you say "connect the server back to the cluster to regain full health," what does this mean in terms of user action? Restarting the server with the --join option? And which server -- the one(s) with the replicas that aren't ready? What replication settings would need to be adjusted, the list of replicas in the shard key for a given table in table_config?

Sorry to ask questions about virtually every noun. :)

chipotle commented 9 years ago

Pinging @danielmewes for any comments on this when you get a chance. It's worth circling back around to.

danielmewes commented 9 years ago

When you say "connect the server back to the cluster to regain full health," what does this mean in terms of user action? Restarting the server with the --join option? And which server -- the one(s) with the replicas that aren't ready?

Yes. In the common case, some of the servers will have failed because of maintenance, or a hardware, power or network failure. So the solution is to start the server back up (in the way you mention), or resolve the issue by replacing hardware, reconnecting the network etc. Once the server comes back and joins the cluster again, the issue will go away. Note that this assumes that the server's data remains intact. A server which hard drive broke for example cannot be brought back up in this sense. If that's the case, users should proceed to the second way of resolving the issue:

What replication settings would need to be adjusted, the list of replicas in the shard key for a given table in table_config?

The missing servers that are not expected to come back must be removed from the replicas list in table_config. If the user wants to maintain the same replication factor that they had before the failure, they might want to replace the replica in the configuration by a different server instead of just removing it. The easiest way to perform these steps is to run reconfigure, or to use the web UI to reconfigure the number of replicas. reconfigure (as well as the web UI which is built on top of it) will only pick servers our of the set of currently connected servers for the new replica list. So the missing servers will automatically be removed from the configuration in the process.

In case a majority of replicas no longer exists, it might be necessary to perform a two-step recovery:

  1. Running reconfigure({emergency_repair="unsafe_rollback"}) on the table to make it configurable by normal means again
  2. Reconfiguring the table through a) a second call to reconfigure (without emergency_repair) b) by using the web UI c) or by removing the missing servers manually from the replicas and nonvoting_replicas list in table_config.