hashicorp / vault

A tool for secrets management, encryption as a service, and privileged access management
https://www.vaultproject.io/
Other
30.84k stars 4.17k forks source link

documentation for backup and restore of Vault #5683

Open weakcamel opened 5 years ago

weakcamel commented 5 years ago

Is your feature request related to a problem? Please describe. A very common task for any sysadmin is to automatically backup data of all applications. Same thing applies to Vault obviously (and since it's a secret management application, it's one of the critical assets). Unfortunately the only documentation for Vault's maintenance I was able to find was https://www.vaultproject.io/docs/install/index.html - installlation guide.

Backup and restore docs are IMO essential part of documentation.

Describe the solution you'd like

Ideally, I'd like to see an Administration (or Maintenance) section on https://www.vaultproject.io/docs/install/index.html which would include a manual how to (a) install (b) back up (c) restore data from backup. It should also mention which files/directories and other data should be preserved to be able to succesfully re-install Vault while preserving the data.

For example of such documentation, see https://docs.gitlab.com/omnibus/README.html or https://www.jfrog.com/confluence/display/RTF/Managing+Backups

Describe alternatives you've considered

I've read through the docs, searched and decided to use mailing list: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/vault-tool/GDhj-KVqtHk/87iY0QwbDAAJ

It did the trick - I was answered with very helpful answers - which I believe belong to actual product documentation.

Explain any additional use-cases

I hope this issue is self-explanatory. Feel free to tell me to clarify if it's not.

Additional context

n/a

evilhamsterman commented 3 years ago

@Bessonov that is consistent with what I said. It is possible to backup the database with filesystem snapshots but it is not recommended without quiescing. Recovery can be very long on a large DB if you have an improper shutdown which is essentially what happens in a snapshot like it says. The CHECKPOINT on Postgres is basically that it flushes the logs, not a full quiesce but close enough it reduces recover. I certainly have backed up small DBs by just taking snapshots and eating the recovery time, and it is better than nothing but it is not great.

@zerkms as Bessonov said that just guarantees transactional consistency within the database, that plays into storage state but is NOT the same. If your DB goes down unexpectedly you will still have replay your logs and as said above that can be very time consuming and may fail.

pcolmer commented 3 years ago

When I set up our Vault 3-node installation, I opted to use Consul as the backend store and wrote a script that would, on every hour, take a Consul snapshot. Please note that I am using the open source versions of the products.

We've been having problems with Consul snapshots recently, so I want to migrate to Vault integrated storage. This would allow me to remove Consul and therefore simplify the architecture.

However, I couldn't find any documentation on how to backup/restore with integrated storage and came across this issue as a result.

I have found https://github.com/Lucretius/vault_raft_snapshot_agent, which looks useful but I have no idea how to restore from a snapshot that it generates.

I mention all of this because integrated storage is a backend that Hashicorp provides but I cannot find any documentation about how to backup & restore it. I haven't seen integrated storage mentioned in this issue so far.

Edited to add that the web UI offers the ability to download and restore a snapshot for the raft storage. Downloading seems to only work when you connect to the raft leader. I'd need to set up a test system to see if restoring works ...

candlerb commented 3 years ago

I have been trying to do the same thing, except restore to a different (and isolated) node. The problem is that the snapshot includes the raft peer IP addresses, so the restored node tries to rejoin the original cluster.

Details in forum: https://discuss.hashicorp.com/t/raft-how-to-restore-from-3-node-ha-cluster-to-1-node-dr-instance/20572

pcolmer commented 3 years ago

@candlerb oh dear, that is not good. Thanks for sharing that.

If the snapshot has to include the IP addresses then the restoration should optionally ignore them.

Preferably, though, it looking increasingly like Vault needs a backend-agnostic method of exporting the Vault configuration and data in a way that can then be imported into a clean Vault setup ... particularly if a different node setup!

candlerb commented 3 years ago

That would certainly be nice. It could also supercede the vault operator migrate command.

Thinks: maybe migrate to filesystem and back again would do it. But this is something I would like to be able to run periodically from a script, and migrate requires shutting down the server :-(

candlerb commented 3 years ago

Somewhat amazingly, it works.

Firstly, you make a raft snapshot on the source (which doesn't require a shutdown), and then you restore the snapshot on the replica (which leaves raft in a broken state, as per forum post).

Then shutdown replica, and migrate to filesystem:

# vault operator migrate -config out.hcl
storage_source "raft" {
  path = "/opt/vault-dev/data"
  node_id = "vault-dev1"
}

storage_destination "file" {
  path = "/opt/vault-dev/export"
}

Then delete and recreate the raft data directory, and migrate back from filesystem to raft:

# vault operator migrate -config in.hcl
storage_destination "raft" {
  path = "/opt/vault-dev/data"
  node_id = "vault-dev1"
}

storage_source "file" {
  path = "/opt/vault-dev/export"
}

cluster_addr = "https://192.0.2.51:18201"
api_addr = "https://192.0.2.51:18200"

chown -R the data directory, restart vault, and voilà! This is pretty horrible though... I would really hope there's something better than this.

candlerb commented 3 years ago

Another related issue is #10361 (Active Node Address stuck)

candlerb commented 3 years ago

I worked out how to reinitialize raft directly, by using recovery mode - notes here.

As far as I can see, just starting in recovery mode by itself is not enough. You have to generate a recovery token (which requires providing the unseal shares via the API), at which point raft sorts itself out. You don't actually need to use the recovery token itself.

yermulnik commented 3 years ago

Apologies for possibly not directly related question but I was not able to find an answer neither in Vault docs nor in google search. What's the steps to restore from the snapshot on another cluster which used its own AWS KMS to unseal? The issue is that when restoring from snapshot Vault attempts to unseal using AWS KMS key of the original cluster (the one that the snapshot was created from) and obviously fails since this AWS KMS key is not accessible. Thanks

candlerb commented 3 years ago

I believe that's impossible, by design. The contents of the vault are encrypted with the unseal key. If you don't have the same unseal key available that it was originally encrypted with, then the data is unusable.

ArieLevs commented 3 years ago

to anyone using raft Integrated Storage, backup and restore worked perfectly fine using snapshot restore.

completely deleted vault cluster, with all pvc's and volumes, restore worked on a clean vault cluster (although there were few error regarding some indexes not restores, still investigating).

oliverjanik commented 3 years ago

Can we please have vault dump even if imperfect?

devops-42 commented 2 years ago

to anyone using raft Integrated Storage, backup and restore worked perfectly fine using snapshot restore.

completely deleted vault cluster, with all pvc's and volumes, restore worked on a clean vault cluster (although there were few error regarding some indexes not restores, still investigating).

@ArieLevs Got no success with that command. I set up a fresh vault cluster and issued:

vault operator raft snapshot restore --force /path/to/snapshot.snap

After about 30 secs I got:

Error installing the snapshot: Error making API request.

URL: POST http://127.0.0.1:8200/v1/sys/storage/raft/snapshot-force
Code: 500. Errors:

* 1 error occurred:
        * failed to read snapshot file: failed to read or write snapshot data: read tcp 127.0.0.1:8200->127.0.0.1:58466: i/o timeout

The snapshot file size is over 1GB, maybe it this could be a problem? I Already tried to raise the timeout values by setting environment variables like VAULT_CLIENT_TIMEOUT but vault did not respect them.

Does anyone has an idea what this message means?

devops-42 commented 2 years ago

For those who are interested: Found out that the timeout comes from a default value which can be changed in the listener section of the vault configuration file by setting http_read_timeout to a sufficient higher value.

heatherezell commented 2 years ago

Hi folks! There is a document on learn about backup and restore for Vault: https://learn.hashicorp.com/tutorials/vault/sop-backup?in=vault/standard-procedures

Is the ask here that there be a link to the tutorial on the docs site? Note that we are currently investigating ways of making the documentation easier to find and more streamlined. If there's something else that is needed, please feel free to let me know!

devops-42 commented 2 years ago

Hi @hsimon-hashicorp

the setting I mentioned above is due to the large file size of our vault.db file. I already commented it in your discussion board: https://discuss.hashicorp.com/t/file-size-of-vault-db/14494. Maybe you could look at it?

Thanks for your help!

aphorise commented 2 years ago

There are learn articles already outlining how back & restores can be achieved using snapshots (the same as most apps) - including what was mentioned earlier:

Simply put - if you have the correct (HCL) file with matching seal & snapshot - then it's a simple vault operator raft snapshot restore

In the case of other storage backends (Consul, etc) a similar restore would be applicable with the need for matching Vault config.

There's also Automated Integrated Storage Snapshots and other features in the Enterprise versions (DR / PR) which can achieve greater HA via a promotional model and not needing to resort to snapshot / restorations.

In any case I believe this request should be closed. @weakcamel since you were the original requestor - can you kindly confirm if available options now address what you were originally after?

PS - you also have standard orchestration / VM level options for backup / restorations too - which can also be a last resort - the same as other service deployments where a point-in-time restoration of a Vault cluster can very well boot up fine - assuming the state when it was captured was in good shape. The same thing is also applicable to snapshots too - you don't know if it's valid or usable backup unless you confirm it (SRE 101).

weakcamel commented 2 years ago

@aphorise Thank you for checking!

There are learn articles already outlining how back & restores can be achieved using snapshots (the same as most apps) - including what was mentioned earlier:

The procedures point to the same link for me, I'm guessing the first one was meant to be https://learn.hashicorp.com/tutorials/vault/sop-backup?in=vault/standard-procedures instead? Just double checking.


can you kindly confirm if available options now address what you were originally after?

Not completely, no. Two open issues and one suggestion:

(1) No general SOP

The procedures above are describing backups only in case of Consul or Raft backend. I appreciate those are the recommended ones, however "recommended" isn't the same as "the only supported".

At the moment, 23 storage backends are listed as available in Vault documentation as so having a detailed SOP for just 2 of them but no suggestion at all about the remaining ones is IMO insufficient. After checking several of them, I didn't find any note stating that Vault doesn't support backups or restore on any of them.

While the exact details of how to operate each backend may be out of scope for Vault Backup/Restore SOP, it would be very useful to have a general steps outlined in one way or the other.

For example:

General backup procedure

For a single instance:

  1. Stop Vault instance
  2. Take a copy of the data from your backend, e.g. copy of directory X in case of Filesystem, a DB backup in case of Postgresql etc
  3. Start Vault again

In case of Vault cluster, you need to ..

Which leads to the second issue below...

(2) Hot vs cold backups

This discussion on Vault mailing list created a big controversy which so far hasn't been addressed yet.

Does Vault instance or cluster indeed need to be stopped to achieve a reliable backup or is that not necessary? If for any of the backends it does need to be shut down, a general SOP (1) would definitely need to mention that. Actually, if it is - sounds like even the 2 suggested backends (Raft/Consul) should mention that.

(3) Cross-references - being able to find the Backup/Restore docs

The tutorials for backup/restore linked above aren't easy to find. If you go to https://hashicorp.com, navigate to Vault and select Documentation - you won't be able to reach them. Search box on this page doesn't return any hits.

Even if you choose Tutorials (which isn't an obvious step) and go throught the procedures available on the left side menu, there's nothing there pointing to Backup & Restore SOPs, which IMO are essential for any admin of any type of application. The only way I was able to find them here is to pub "backup" in search box and the results are somewhere there down the road. The first hit is the Backup Consul Data and State doc which isn't quite the same?

It would be great if those documentation sources - especially https://www.vaultproject.io/docs - cross-referenced the backup and restore documentation so that they're easy to navigate to.

aphorise commented 2 years ago

Some really great points @weakcamel.

23 storage backends....

I believe the bigger issue is coming up with the proper size fits all in terms of SOPs there will likely be more SOPs on that learn guide.

But I think only: Raft or Consul are what's officially supported as they deliver higher HA unlike others; the reset all being community driven. I believe, if presented, there would be reception to any additional material (docs or even learn guides possible) that could be listed for all others if there are any suggestions.

This discussion on Vault mailing list created a big controversy which so far hasn't been addressed yet.

Does Vault instance or cluster indeed need to be stopped to achieve a reliable backup or is that not necessary? If for any of the backends it does need to be shut down, a general SOP (1) would definitely need to mention that. Actually, if it is - sounds like even the 2 suggested backends (Raft/Consul) should mention that.

In my opinion - No. Application level snapshot (cold) are often taken on run-time and generally like in the case of Consul lower level file system would require released locks to be able to do anything with them. If I'm not mistaken the snapshot approach in both Raft & Consul perform some sort of sanitised bundling that includes sha256sum as well. To confirm reliability snapshot could be tested for proper boot and / startup - similar to testing common to most critical saves (DBs, etc).

BTW - in the case of using file system - that's probably just for testing / demo purposes only (not production) then those backups would have to stopped Vault service before copy.

(3) Cross-references - being able to find the Backup/Restore docs

The tutorials for backup/restore linked above aren't easy to find.

... It would be great if those documentation sources - especially https://www.vaultproject.io/docs - cross-referenced the backup and restore documentation so that they're easy to navigate to.

Totally agree 😃 - care to draft a mock or screen capture (dropped in here) with some arrows or depiction of where you're looking? - maybe we can reason to get it included and I'm always for better clarity and more re-enforced linking.

weakcamel commented 1 year ago

23 storage backends....

I believe the bigger issue is coming up with the proper size fits all in terms of SOPs there will likely be more SOPs on that learn guide.

IMO lack of crucial documentation is always worse than an oversized manual.

But I think only: Raft or Consul are what's officially supported as they deliver higher HA unlike others; the reset all being community driven.

The issue is, documentation doesn't phrase it like that at all. If you go to https://www.vaultproject.io/docs/configuration/storage, it reads:

For example, some backends support high availability while others provide a more robust backup and restoration process. For information about a specific backend, choose one from the navigation on the left.

In some backends you'll see the note about support for HA, in some you won't. And Consul + Raft aren't the only ones: Postgres and S3 also support HA. None of them however says anything more about backup & restore.

It's simply a bit vague. If something's not supported (e.g. because it wasn't written or adopted by Hashicorp) that's a shame but fair enough - as users, we just need a clear statement. "Only these 2/3 filesystems have an official support for backup and restore" is very diffferent from "Some backends are more robust than others".


In my opinion - No. Application level snapshot (cold) are often taken on run-time and generally like in the case of Consul lower level file system would require released locks to be able to do anything with them. If I'm not mistaken the snapshot approach in both Raft & Consul perform some sort of sanitised bundling that includes sha256sum as well. To confirm reliability snapshot could be tested for proper boot and / startup - similar to testing common to most critical saves (DBs, etc).

That's very useful piece of information and IMO would be great to have that in the official docs.


BTW - in the case of using file system - that's probably just for testing / demo purposes only (not production) then those backups would have to stopped Vault service before copy.

That's definitely the case for in-memory backend, however filesystem has the following description:

The Filesystem storage backend stores Vault's data on the filesystem using a standard directory structure. It can be used for durable single server situations, or to develop locally where durability is not critical.

  • No High Availability – the Filesystem backend does not support high availability.
  • HashiCorp Supported – the Filesystem backend is officially supported by HashiCorp.

As an officially supported backend - if only for simplest cases - it deserves a clear backup & restore documentation.


Care to draft a mock or screen capture (dropped in here) with some arrows or depiction of where you're looking?

Ideally an "Admin guide", "Operator procedures" or similar could be added to include both backup, configuration, upgrade guides (maybe even installation?)

Screenshot 2022-09-09 at 09 37 26

A separate top-level "Backup and restore" page would be fine too.

weakcamel commented 1 year ago

P.S. And to address the sentiment of others who have commented on this issue, see just the top results from a quick search:

The number of third-party methods to back up Vault is a clear indication that there's demand for a clearer, portable and more open data export/import method.

aphorise commented 1 year ago

@weakcamel - lets say if it was to go into into own new section as you're proposing - the content should really be a rehash of the SOPs excluding any of these 3rd party tools. Dont get me wrong while these tools looks may be great and you can always do your own and will always need to do your own in the case of all other storage types (excluding Consul or Raft). Maybe I'm understating it but in essence I see this is a typical save / restore operation + make sure you got the right config + unseal mechanism and or recovery keys (typical for Vault).

If you have a something a bit more refined or specific in mind and feel you can already draft a PR then please do so.

weakcamel commented 1 year ago

@aphorise Sorry if it wasn't clear - the last post (open source backup/restore methods) was only a side note. Having a clear documented backup&restore is a separate issue from a particular mechanism.

That said, from a user or admin perspective a simple export/import would have been ideal. If Vault provided that out of the box, we wouldn't have to worry about specific instructions for specific storage backend and all the corner cases.

aphorise commented 1 year ago

@weakcamel - While I can appreciate the convenience or easy that you are after there are I believe intentional design decisions and security factors why these things are separate that you may be over-looking.

Simply put - the whole design is to ensure if storage is compromised (ie stolen or copied) then it will be of little use without access to to the seal - access to seal without configuration the same - access seal with conf without recovery keys the same, etc.

What's more storage types are not likely to be interchangeable any time soon - and it's likely that may never be avaible for the same reasons that you can not have PNG interchanged to JPG without some potential for loss or other conversions issues. However there is already the migration approach in place where you can technically go from any back-end to another.

The request on this issue from the onset was for backup recovery literature which is now in place by way of the aforementioned learn guide / SOPs which was not previously there:

In the interest of closing this request before its 4th anniversary - my understanding at this stage is that what's pending is an explicit section for backup / restore elaboration within the general Documentation area (aka CLI docs).

Technically what was requested from the onset is already delivered - however just not in the format and expected documentation areas. If that's correct then I can try to do a draft in the coming days / weeks or as I mentioned earlier if you have impressions or opinions on how it could be then please do share a textual template (by way of a PR). The Vault teams are typically very receptive to substantiated contributions / bits and I'm sure if it's submitted well with clear benefits for all then it will be accepted.

weakcamel commented 1 year ago

In the interest of closing this request before its 4th anniversary - my understanding at this stage is that what's pending is an explicit section for backup / restore elaboration within the general Documentation area (aka CLI docs).

Agreed 100%. The import/export is a whole different debate and I appreciate that something that sounds simple may in reality be far from it.

Technically what was requested from the onset is already delivered - however just not in the format and expected documentation areas.

Mostly yes, except for backup & restore for the filesystem backend. Since it's also a backend officially supported by Hashicorp (even if for simplest deployments), it IMO also should have a backup/restore SOP - or a clear statement saying that backup/restore is not available on this backend.

For the existing SOPs, cross-referencing them in https://www.vaultproject.io/docs and adding to a doc index on https://learn.hashicorp.com/vault would do the trick.

aphorise commented 1 year ago

Well technically snapshot save / restore is only for Integrated Storage / Raft right? - with the only other supported store by them being Consul.

So actually even though they may provide input on the use of Filesystem - it's certainly not supported by them - especially as it's not intended for use in any production setting or environment and typically for nothing more than demonstration or other such exceptions (development). If I open an enterprise support ticket with them they will say the same.

If you also refer to the very same Filesystem that you mentioned - it's not a HA backend - vs others that are for example and already you can envisage two major sets both with further sub-category and considerations and so a complete guide would really be impossible especially for some of those community stores that the Vault engineers have very little opinion on. Rather if the community had an opinion for example on how Filesystem should be backed up / restored then we may be able to get it included but Hashi is not likely to do that as it falls outside their development or support focus.

Anyway thanks for confirming:

For the existing SOPs, cross-referencing them in https://www.vaultproject.io/docs and adding to a doc index on https://learn.hashicorp.com/vault would do the trick.

I hope to share something in the near future.

PS - I forgot to mention there's also a preference from Hashi to opt for Integrated Storage - so for example if you look at the Consul reference architecture that also mentions to use Raft / Integrated Storage if you dont have any strong reasons to use Consul 😄

weakcamel commented 1 year ago

So actually even though they may provide input on the use of Filesystem - it's certainly not supported by them - especially as it's not intended for use in any production setting or environment and typically for nothing more than demonstration or other such exceptions (development).

I'm just going to just leave this link here, let the official documentation speak for itself:

Screenshot 2022-09-10 at 12 20 53

aphorise commented 1 year ago

Well durable is a bit of a subjective term there right?... if it's durable for you then why not but are file based web service durable like that in a single instance? 😃 (how many do you know?)

Also support is an even more lose term there... yes code changes or improvements to filesystem they support! - but not going to production or expecting their support with millions of identities / secrets in any High Availability sense for any large enterprise setting. For some exceptional / single use sure why not use Filesystem but just cause it says it's supported there it does not make it co-equal to all others they support too.

BTW the same definitions apply to In-Memory Storage Backend which they supported too but does that mean they'd support you to go to production with it? - maybe - but again I'd bet it would be more exception.

What's more you'll only find two reference architectures; which is not possible with the other storage that are missing features / guarantees as I mentioned earlier.

weakcamel commented 1 year ago

@aphorise Both "durable" (able to perform for a long time without loss of data / quality) and "supported" (something the producer considers valid and well within customer rights to use) have a pretty clear meaning to me. If you find them unclear, I guess you could ask the authors of the documentation?


BTW the same definitions apply to In-Memory Storage Backend which they supported too but does that mean they'd support you to go to production with it?

There's another sentence in the source you're pointing to which clearly states:

Not Production Recommended – the In-Memory backend is not recommended for production installations as data does not persist beyond restarts.

Did you read the source you're referring to?

I'm afraid I don't follow anymore, this is turning into a debate over semantics. Any chance you could explain your role in this project and whether you're a Hashicorp employee or not? You keep stating facts as if you were, but at the same time keep referring to Hashicorp as "them".

aphorise commented 1 year ago

Semantics is what we are after here? 😃 or am I misunderstanding? - so it's better to be clear & I'm trying to get how this request will come to a close. I'm a consumer of Vault the same as yourself I guess?

I believe what you're asking for is a recovery / restoration reference to how backup & restores ought to be done right? - and I'm trying to help if you see it fit.

The statement It can be used... is true but equally any other storage can be used which does not give HA. All I'm trying share for anyone else reading is that Consul & Integrate Storage that are HA + seem to be the way to go where you can have more than 1 node perform snapshot save / restore as the SOP's are showing.

weakcamel commented 1 year ago

Semantics is what we are after here? 😃 or am I misunderstanding?

No, I'm after a missing piece of documentation for a supported backend. That's what I've been asking here 4 years ago and still am.

SayakMukhopadhyay commented 1 year ago

@aphorise I have been following this issue and the endless back and forth for 4 years too. Can you please stop trying to derail the main ask of this issue and responding condescendingly?

Any chance you could explain your role in this project and whether you're a Hashicorp employee or not? You keep stating facts as if you were, but at the same time keep referring to Hashicorp as "them".

At this point I too am curious about your role in Hashicorp. We users need to know if this response is coming from within the company or an external contributor cause that would help us in understanding the role Hashicorp is playing in deliberately not allowing users use a supported but 3rd party storage provider in production.

aphorise commented 1 year ago

Asking for clarity and offering help is not derailing - neither is it condescending to share opinions as I have been doing so far.

The original ask of this issue was rather generic and simply states:

... It should also mention which files/directories and other data should be preserved to be able to successfully re-install Vault while preserving the data.

This would obviously depend on used store type which are numerous (20+?).

My participation here, or lack thereof, wont materialize what's being sough after if we dont discuss some of the items that have been done so far with the screenshots which are a step closer right?

@weakcamel if you are saying:

No, I'm after a missing piece of documentation for a supported backend. That's what I've been asking here 4 years ago and still am.

If this is with reference to Filesystem - then I personally am not sure what that guide would say? - stop Vault - copy files - then start Vault again? - other stores are similar - where you copy the contents of your stores /path and the config will help perform a low-level backup / restore.

I'm not contesting that a guide should be there but it would help to draft something around what's expected especially if there are so many interests and people in the community commonly using Filesystem. Maybe the title of issue could also be adjusted to read the same ... filesystem...?

heatherezell commented 1 year ago

Hi, folks. I'm going to lock commenting down for now while we in the Community team look into this. Thanks for your patience.