Closed slackpad closed 3 years ago
It would be good to discuss how to use the automatically created snapshots also (they seem to be created regularly ever few hours).
cd /var/lib/consul/raft/snapshots/2-146226-1481465408058
[in snapshot directory]
# sha256sum * >SHA256SUMS
# tar -czf /tmp/recreated.snap *
# consul snapshot restore -token=... /tmp/recreated.snap
Restored snapshot
I'm exploring using the new snapshots feature as a backup mechanism as a mitigation tactic against accidental data loss.
I noticed that in the snapshot docs it says:
Restores involve a potentially dangerous low-level Raft operation that is not designed to handle server failures during a restore. This operation is primarily intended to be used when recovering from a disaster, restoring into a fresh cluster of Consul servers.
Can you add some clarification of what exactly that means? Conventional wisdom about database backups is that you should exercise them regularly. If we were to use the restore operation on our most recent snapshot weekly would we be at risk of data loss?
Probably unrelated to this issue, but is there some feasible mechanism that could be added to restore only certain keys? For example, imagine that 1000 keys were deleted 6 hours ago and many other keys were modified/updated since then. Would there be a way to restore only the 1000 deleted keys? Or do we need to keep our own separate dump of kv pairs and restore them through the normal /v1/kv
API?
Edit: looks like once we upgrade we can use kv import
and kv export
to replace our JSON dumping process, but it uses the same /v1/kv
API so will perform at the same speed
Can you add some clarification of what exactly that means? Conventional wisdom about database backups is that you should exercise them regularly. If we were to use the restore operation on our most recent snapshot weekly would we be at risk of data loss?
There's a little more detail in the comment here. The restore is implemented by having the leader take on the state of the snapshot and then bump the raft index which creates a "hole" in the Raft log, which causes the snapshot to go out to its followers. This means that the server commits the restore before replicating anything to its followers, which is weird from a Raft perspective, and could leave the cluster in an incorrect state if the leader were to die during that restore operation. If that happened you might have to blow away your server state and do the restore into a fresh cluster to recover. This should be a very unusual case to hit in practice (and the restore API returns success only once the followers have replicated the snapshot itself), but we wanted to fully disclose this possibility.
Where is this outage recovery guide? The link mentioned in the groups was 404
@richard-mauri https://www.consul.io/docs/guides/outage.html
The thing that's still missing from https://www.consul.io/docs/guides/outage.html is the simple restore of a snapshot. On our cluster we take and save regular snapshots; of course Consul takes them as well. Snapshots can easily be restored to build a cluster even from scratch.
Our disaster restore process is at https://github.com/drud/vault-consul-on-kube/blob/master/troubleshooting.md#complete-loss-and-rebuild-with-recovery-using-a-consul-snapshot
-Randy
On Fri, May 26, 2017 at 9:17 AM, James Phillips notifications@github.com wrote:
@richard-mauri https://github.com/richard-mauri https://www.consul.io/docs/guides/outage.html
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hashicorp/consul/issues/2583#issuecomment-304310012, or mute the thread https://github.com/notifications/unsubscribe-auth/AAG3PDc1Xda12souqilut-vpsYjLBNQzks5r9u0UgaJpZM4LIJtV .
-- Randy Fay randy@randyfay.com +1 970.462.7450
Hi all!
I see it's been a while since the last comment, but wanted to report that documentation on how to use the consul snapshot
commands to recover from a outage can be found in the following locations:
https://learn.hashicorp.com/tutorials/consul/backup-and-restore
https://learn.hashicorp.com/tutorials/consul/recovery-outage?in=consul/datacenter-operations
I'll go ahead and close this, but do reach out if you have any other questions or comments!
We definitely need to mention snapshot/restore on here. Things to cover:
consul snapshot inspect
can help you figure out which snapshot is better.https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/consul-tool/_TitQGHdRSA/18mZiFnJCQAJ