hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.25k stars 4.41k forks source link

Add snapshot/restore to outage recovery guide #2583

Closed slackpad closed 3 years ago

slackpad commented 7 years ago

We definitely need to mention snapshot/restore on here. Things to cover:

  1. Mention Consul Enterprise and the snapshot agent.
  2. Show an example disaster recovery restore and mention how it works into a fresh cluster.
  3. Mention how ?stale can be used to snapshot even if there's no leader, and how consul snapshot inspect can help you figure out which snapshot is better.

https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/consul-tool/_TitQGHdRSA/18mZiFnJCQAJ

rfay commented 7 years ago

It would be good to discuss how to use the automatically created snapshots also (they seem to be created regularly ever few hours).

cd /var/lib/consul/raft/snapshots/2-146226-1481465408058
[in snapshot directory]
# sha256sum * >SHA256SUMS
# tar -czf /tmp/recreated.snap *
# consul snapshot restore -token=... /tmp/recreated.snap
Restored snapshot
mpuncel commented 7 years ago

I'm exploring using the new snapshots feature as a backup mechanism as a mitigation tactic against accidental data loss.

I noticed that in the snapshot docs it says:

Restores involve a potentially dangerous low-level Raft operation that is not designed to handle server failures during a restore. This operation is primarily intended to be used when recovering from a disaster, restoring into a fresh cluster of Consul servers.

Can you add some clarification of what exactly that means? Conventional wisdom about database backups is that you should exercise them regularly. If we were to use the restore operation on our most recent snapshot weekly would we be at risk of data loss?

Probably unrelated to this issue, but is there some feasible mechanism that could be added to restore only certain keys? For example, imagine that 1000 keys were deleted 6 hours ago and many other keys were modified/updated since then. Would there be a way to restore only the 1000 deleted keys? Or do we need to keep our own separate dump of kv pairs and restore them through the normal /v1/kv API?

Edit: looks like once we upgrade we can use kv import and kv export to replace our JSON dumping process, but it uses the same /v1/kv API so will perform at the same speed

slackpad commented 7 years ago

Can you add some clarification of what exactly that means? Conventional wisdom about database backups is that you should exercise them regularly. If we were to use the restore operation on our most recent snapshot weekly would we be at risk of data loss?

There's a little more detail in the comment here. The restore is implemented by having the leader take on the state of the snapshot and then bump the raft index which creates a "hole" in the Raft log, which causes the snapshot to go out to its followers. This means that the server commits the restore before replicating anything to its followers, which is weird from a Raft perspective, and could leave the cluster in an incorrect state if the leader were to die during that restore operation. If that happened you might have to blow away your server state and do the restore into a fresh cluster to recover. This should be a very unusual case to hit in practice (and the restore API returns success only once the followers have replicated the snapshot itself), but we wanted to fully disclose this possibility.

richard-mauri commented 7 years ago

Where is this outage recovery guide? The link mentioned in the groups was 404

slackpad commented 7 years ago

@richard-mauri https://www.consul.io/docs/guides/outage.html

rfay commented 7 years ago

The thing that's still missing from https://www.consul.io/docs/guides/outage.html is the simple restore of a snapshot. On our cluster we take and save regular snapshots; of course Consul takes them as well. Snapshots can easily be restored to build a cluster even from scratch.

Our disaster restore process is at https://github.com/drud/vault-consul-on-kube/blob/master/troubleshooting.md#complete-loss-and-rebuild-with-recovery-using-a-consul-snapshot

-Randy

On Fri, May 26, 2017 at 9:17 AM, James Phillips notifications@github.com wrote:

@richard-mauri https://github.com/richard-mauri https://www.consul.io/docs/guides/outage.html

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hashicorp/consul/issues/2583#issuecomment-304310012, or mute the thread https://github.com/notifications/unsubscribe-auth/AAG3PDc1Xda12souqilut-vpsYjLBNQzks5r9u0UgaJpZM4LIJtV .

-- Randy Fay randy@randyfay.com +1 970.462.7450

ChipV223 commented 3 years ago

Hi all!

I see it's been a while since the last comment, but wanted to report that documentation on how to use the consul snapshot commands to recover from a outage can be found in the following locations:

https://learn.hashicorp.com/tutorials/consul/backup-and-restore

https://learn.hashicorp.com/tutorials/consul/recovery-outage?in=consul/datacenter-operations

I'll go ahead and close this, but do reach out if you have any other questions or comments!