documentation for backup and restore of Vault

weakcamel commented 5 years ago

Is your feature request related to a problem? Please describe. A very common task for any sysadmin is to automatically backup data of all applications. Same thing applies to Vault obviously (and since it's a secret management application, it's one of the critical assets). Unfortunately the only documentation for Vault's maintenance I was able to find was https://www.vaultproject.io/docs/install/index.html - installlation guide.

Backup and restore docs are IMO essential part of documentation.

Describe the solution you'd like

Ideally, I'd like to see an Administration (or Maintenance) section on https://www.vaultproject.io/docs/install/index.html which would include a manual how to (a) install (b) back up (c) restore data from backup. It should also mention which files/directories and other data should be preserved to be able to succesfully re-install Vault while preserving the data.

For example of such documentation, see https://docs.gitlab.com/omnibus/README.html or https://www.jfrog.com/confluence/display/RTF/Managing+Backups

Describe alternatives you've considered

I've read through the docs, searched and decided to use mailing list: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/vault-tool/GDhj-KVqtHk/87iY0QwbDAAJ

It did the trick - I was answered with very helpful answers - which I believe belong to actual product documentation.

Explain any additional use-cases

I hope this issue is self-explanatory. Feel free to tell me to clarify if it's not.

Additional context

n/a

zeagord commented 5 years ago

+1 from me. It will be helpful if there are some recommendations, success stories, etc around it.

antcs commented 5 years ago

CoreOS has a little doc for backup vault to aws s3 bucket: https://coreos.com/tectonic/docs/latest/vault-operator/user/recovery.html

siepkes commented 4 years ago

@antcs I think even the CoreOS authors may have gotten it wrong.

As stated in #7191 even if you can make an atomic snapshot of the backend Vault itself doesn't make it's changes in an atomic way in its backend. Meaning there is no way you can guarantee your backup is in a state which is consistent (and therefor usable) if Vault is running. The only way you can currently get a consistent snapshot of Vault's data is if you stop Vault, backup the backend and start Vault again.

thiloplanz commented 4 years ago

Vault itself doesn't make it's changes in an atomic way in its backend. Meaning there is no way you can guarantee your backup is in a state which is consistent (and therefor usable) if Vault is running. The only way you can currently get a consistent snapshot of Vault's data is if you stop Vault

Backups aside, if Vault does not make transactional writes with any backend, and also does not know how to recover from an atomic point-in-time storage-level snapshot of these potentially logically incomplete writes (by applying redo/undo logs or such from the storage), does this not also mean that Vault cannot reliably recover from an abrupt instance failure in between two writes?

Please tell me that is not the case ... @siepkes

weakcamel commented 4 years ago

@thiloplanz I'm quite sure that's the case (in worst case). That's also why I freaked out reading the original response on Vault mailing list.

siepkes commented 4 years ago

@thiloplanz Yeah that thought occurred to me too. I'm no expert on Vault's low level storage so what follows is mostly my deduction and assumptions so I could be wrong.

On the mailinglist Chris Hoffman (HashicCorp employee and Vault comitter) stated:

Since our storage layer is generic, we do not have a way to perform atomic transactions for multiple writes required for some operations. You could end up corrupting your data but it really just ends up that the behavior is undefined and there isn’t any guarantee here.

A quick glance at for example the PostgreSQL storage implementation shows that it exposes kind of a low level generic interface to the rest of the application. The rest of the application uses this interface to (sometimes) perform compound actions. For example call the update function 2 times to perform an operation which is functionally a single operation. This in contrast to for example an storage API which would expose high level operations and wraps the 2 updates in a single transaction or exposing a transaction API in the storage abstract itself so the caller can indicate what is a compound operation.

So backend data can get corrupted during an abrupt failure like an application panic. So the only thing that could save you from a really bad day is if Vault is smart enough to recover (ie. start normally with minimal data loss) with an inconsistent (ie. corrupt) data backend. I can't really find anything that would point to such capabilities in the source (again, could be wrong). Though if this was the case it wouldn't be a problem to end up with an inconsistent backup since Vault would still be able to recover from it and the backup advice would simply be: "backup the backend with the tools provided by the backend". But thats not the case.

thiloplanz commented 4 years ago

So backend data can get corrupted during an abrupt failure like an application panic. So the only thing that could save you [...] is if Vault is smart enough to recover (ie. start normally with minimal data loss) [..]. I can't really find anything that would point to such capabilities in the source.

Meaning that regardless of choice of storage backend, a sudden power outage at an unfortunate point in time can leave Vault in an undefined state.

@chrishoffman Is this assessment correct?

siepkes commented 4 years ago

@chrishoffman I don't want to be pushy or sound alarmist (I realize you don't owe me anything) but I'm somewhat unsettled by the fact that currently I don't really see how one can make a proper backup of Vault (ie. a consistent dump while Vault is running). Automated shutdown and start of Vault seems kind of a risky operation to perform daily for backups. Could you give some feedback on this? Would love to hear it if I'm talking nonsense ;-).

pznamensky commented 4 years ago

Did anyone find a working solution for creating backups? I feel very uncomfortable without backups on production :slightly_smiling_face:

jefferai commented 4 years ago

@pznamensky Sure -- take atomic snapshots at the storage level.

Vault doesn't write everything transactionally because we can't rely on having that capability in storage, but instead we write the code such that a failure in the middle of a request can be tolerated. We do this in various ways, via how we order writes, using WALs, etc. We can always improve this, but the idea that Vault will be in some unworking undefined state if improperly shut down isn't the case, and thus atomic storage snapshots are also fine.

siepkes commented 4 years ago

@jefferai Thanks for your answer!

So the definitive answer is that making an atomic snapshot of the backend is enough and Vault will work with that?

I'm double checking because what your Hasicorp co-conspirator :wink: @chrishoffman says on the mailinglist seems to be contradictory to what your saying (emphasis mine):

Since our storage layer is generic, we do not have a way to perform atomic transactions for multiple writes required for some operations. You could end up corrupting your data but it really just ends up that the behavior is undefined and there isn’t any guarantee here.

vladimir-avinkin commented 4 years ago

I'm going to preface this post with an idea that a) Vault is a very nice piece of software which solves a very hard problem b) I do not want to sound entitled to solution, just to bring attention that important issue (as i understand it, at least) c) I'm very thankful for the work provided on this project

But to me current backup situation seems extremely worrying to the point that I'm afraid to run vault in production environments

Replies above stated that vault behavior when restoring from a hard crash (kill -9/power issues) is undefined even if storage backend can provide consistency guarantees (such as postgres or other dbms) Which is not the end of the world if vault could be consistently backuped, but again replies above imply that backing up storage backend cannot guarantee a valid vault state, even if it's made atomically.

weakcamel commented 4 years ago

I absolutely second everything @mouzfun said, including the preface.

I'm thinking that with discrepancy in the comment above by @jefferai and the replies on the google groups, it would be best that this is simply clearly documented in official docs. to have the definite answer... nudge, nudge, pretty please Hashicorp? :-)

Aldekein commented 4 years ago

Please add a backup/restore guide. I got here after I searched the documentation and didn't find a way for a backup. It would be great if such procedure is documented and battle-ready. Thank you!

tmolnar0831 commented 4 years ago

Is it possible to either get a statement from Hashicorp that the Open Source version of Hashicorp Vault cannot be backed up, or get an official documentation to backup data from it in a safe way? I think it is a show stopper issue for a lot of individuals and companies. Thank you in advance for your kind help!

ANPdjesrani commented 4 years ago

even an authenticated "vault secret kv dump" and "restore" would help immensely, like we are able to do with consul.

dr4Ke commented 4 years ago

For the moment, we're doing vault backups by "migrating" data from consul to a filesystem storage backend. "Vault's data is encrypted at rest" so we just make sure access to this backup is restricted. The data can then be restored and brought back up using the key fragments.

vault operator migrate -config vault-migrate-backup.hcl

storage_source "consul" {
  address = "127.0.0.1:8500"
  path    = "vault"
}

storage_destination "file" {
 path = "/tmp/vault-backup"
}

Hope this is not a bad idea.

weakcamel commented 4 years ago

Using vault operator migrate sounds very elegant.

It does sound like it may not be completely safe though (unless you shut Vault down while doing that):

https://www.vaultproject.io/docs/commands/operator/migrate

This is intended to be an offline operation to ensure data consistency, and Vault will not allow starting the server if a migration is in progress. ... Vault will need to be offline during the migration process. First, stop Vault. Then, run the migration on the server you wish to become a the new Vault node.

tmolnar0831 commented 4 years ago

In case of the Raft storage the snapshot seems to be a reliable solution. Right?

weakcamel commented 4 years ago

@michelvocks with all my appreciation for the great application that Vault is and the work, I believe this is not appropriately tagged as docs and enhancement.

Backup/restore is an essential feature of every product and lack of clear way to achieve it is in my opinion a high priority bug.

klemens-u commented 4 years ago

Could we get an official statement on this situation please? Thanks a lot. Vault is great in every other regard.

cliedelt commented 4 years ago

Yea. We are planing to use vault in production... Tbh this is a complete show stopper...

lborupj commented 4 years ago

Wow - I was quite unpleasently surprised by this. One question though, if I were to write a backup script, by stopping Vault, Tar'ing the data and starting Vault again (and unsealing) - is it guaranteed that stopping Vault in a normalt way (I'm running it as a Docker image) will do so in a secure way, meaning all writes are done before the process exists?

edoardo-c commented 4 years ago

For what it's worth, if you are using consul as storage, you seem to be able to do a proper backup. I am new to this and have limited experience but I was able to trash my namespace holding consul and vault, then restore vault from backup.

in consul-master-0

# consul snapshot save vault-dev.snap
# consul snapshot inspect vault-dev.snap

copy the snapshot for safe keeping. After trashing the environment and rebuilding it,

copy the snapshot back to the new consul-master-0,

log into consul-master-0

# consul snapshot inspect vault-dev.snap
# consul snapshot restore vault-dev.snap

Your vault will be sealed. Unseal with old unseal keys and voilà!

Hope this helps.

vladimir-avinkin commented 4 years ago

Making atomic storage backend backups is your best bet so far, yes. But the problem is that apparently (according to the mentioned above google groups post) vault does not write its state atomically, even if the storage backend itself supports it.

duckie commented 4 years ago

@pznamensky Sure -- take atomic snapshots at the storage level.

Vault doesn't write everything transactionally because we can't rely on having that capability in storage, but instead we write the code such that a failure in the middle of a request can be tolerated. We do this in various ways, via how we order writes, using WALs, etc. We can always improve this, but the idea that Vault will be in some unworking undefined state if improperly shut down isn't the case, and thus atomic storage snapshots are also fine.

Like @mouzfun , I am grateful for the free software offered to the community by Hashicorp.

However

For running in production, one needs to know how the guarantees are implementd. "Various ways, using WALs (several?), etc" is too vague.

I also do not quite understand "we can't rely on having that capability in storage" ? I dont see it as a valid reason for not doing it when the backend supports it.

Without proper guarantees, this issue is a deal breaker for production use.

mazzy89 commented 4 years ago

The way how we are tackling down this running Vault with Raft as storage backend is to run a service in each node which backup the data snapshot. The service I wrote is inspired to the etcd-manager built in the scope of the kops project to backup etcd.

bbros-dev commented 4 years ago

Note sure why Hashicorp hasn't pointed to this....

The following Hashicorp Support article details Migration of Vault Data Stored in Consul.

This article provides some detail and starting points related to migration of Vault data stored in a Consul cluster for the purposes of informing your own Vault backup/restore and data migration strategies when using Consul as your Vault storage backend.

@weakcamel can this issue be closed?

darkpixel commented 4 years ago

@bbros-dev not everyone uses a consul cluster.

bbros-dev commented 4 years ago

@darkpixel, true. There are, at this point in time, 22 backends.

Is the expectation this issue should be considered addressed when all 22 backends have such documentation?

For my 2c: The consul backend is a core OSS component, which Hashicorp gave us, so it is great that backend use case has been documented - giving the 'starting points' for other backend users to consider. The Filesystem backend use case is trivial - since it is a single server (dev/play) scenario. Or as @jefferai suggested:

... take atomic snapshots at the storage level.

The CoreOS document illustrates @jefferai's subtle point that just what is required to backup and restore data really depends on the backend you use.
Again my 2c: I agree; The state of our application/secrets at any point in time is outside of Vault's view and is our responsibility. Likewise, the state of the storage backend is also beyond Vault knowledge/control e.g. do your HDD cache etc, etc. etc. IF Vault restricted your backend choice, AND was closed source I could understand some of the objections. Your backend selection process will have (?) addressed what the backend backup/restore processes are. Example @mazzy89 rolled his own.

P.S

I don't understand @duckie's objection and all the upvotes:

... For running in production, one needs to know how the guarantees are implementd. "Various ways, using WALs (several?), etc" is too vague.

Which is why you chose an open source component - you know 'exactly' how everything on the Vault side is implemented: You have the source code.

No?

siepkes commented 4 years ago

@bbros-dev I think the overall feeling is that making a consistent dump should be possible on the application level and not be dependent on the storage backend.

bbros-dev commented 4 years ago

@siepkes, understood.
What is a 'consistent dump' depends on how you have written your application, how you have configured Vault, and what backend you selected. Note the premise in 'consistent dump' - your application has a consistent state - many apps don't implement that constraint. If Vault provides a 'consistent' state it has by definition forced a constraint on apps using Vault. Rather, consider that either Vault is working, or not, for a particular operation.

I'm not saying there was no need for this issue.

I also am not saying that for the backend someone selects and the way they wrote their application (or most likely inherited a legacy application) that a consistent dump is not sensible.

However getting a "working (at some point in time, etc etc) copy" of Vault data is possible (not ideal I acknowledge that too) for any backend - shutdown Vault, check it didn't exit with an error - record the state of your backend. Restart Vault, check it didn't start with an error. @jefferai has already provided the answer for file system backends - which from memory the documentation indicates is only appropriate for single server use cases.
It does sound like people might be expecting Vault to implement the functionality in, for example btrfs, zfs etc.

I am suggesting the documentation that has been provided could be considered to close this issue - it points out the traps-for-young-players (e.g. the dynamic secrets, user state, HA etc etc.).

I did ask if the expectation is that all backends be documented - that was a genuine question.

bbros-dev commented 4 years ago

@weakcamel I agree your question to the mail list was answered sufficiently, and needs to be added in the documentation. Specifically, Chris Hoffman's response is worth repeating and it does not warrant any of the "Vault is not ready for production" alarm-ism above:

Backups are always based on the storage backend you are using and do require Vault to be offline to ensure consistency.

However the documentation I linked to addressed the Consul backend which is the recommend HashiCorp Vault deployment practice. So perhaps a more natural home for the linked support article content might be as an expanded Corruption or Sabotage Disaster Recovery section in the Reference Architecture?

I didn't want to appear to suggest the document I linked to should be in place of your suggested location - which I think is natural place for Chris Hoffman's statement and could link to the Consul specifics as the recommended practice. Possibly with the CoreOS/etcd page cited above linked to as another example?

lborupj commented 4 years ago

How can a consistent dump be application specific? By and default, if what you say, is true, then Vault can never know, regardless of chosen backend, when it is in a consistent state - which IMHO is not true. Vault operations should be atomic in the sense that 'write single k/v' must be .. and it is this, we 'alarmists' :-) are questioning, if Vault supports this, or not .. so we can gracefully stop Vault, take a backup, and restart (or do a snapshot).

I agree that, if your application needs to put 2 or more values to Vault, there is no way to do this transactionally (to my knowledge) and this is something you need to handle yourself ... which is why I would argue Vault is .. a Vault .. and not some database like product where you do a lot of writes .

Coming back to why the missing snapshot feature would have been sooo nice, is that to simplify deployment models we could have a master/slave setup, with one 'write' master and 'reader' slaves if we were just able to snapshot the filestorage, maybe even a PV on kubernetes if Vault could be started in RO mode.

Well, if the glove doesn't fit ... We are still using Vault but looking for alternatives

ons. 22. jul. 2020 10.24 skrev Begley Brothers (Development) < notifications@github.com>:

@siepkes https://github.com/siepkes, understood. What is a 'consistent dump' depends on how you have written your application, how you have configured Vault, and what backend you selected. Note the premise in 'consistent dump' - your application has a consistent state - many apps don't implement that constraint. If Vault provides a 'consistent' state it has by definition forced a constraint on apps using Vault. Rather, consider that either Vault is working, or not, for a particular operation.

I'm not saying there was no need for this issue.

I also am not saying that for the backend someone selects and the way they wrote their application (or most likely inherited a legacy application) that a consistent dump is not sensible.

However getting a "working (at some point in time, etc etc) copy" of Vault data is possible (not ideal I acknowledge that too) for any backend - shutdown Vault, check it didn't exit with an error - record the state of your backend. Restart Vault, check it didn't start with an error. @jefferai https://github.com/jefferai has already provided the answer for file system backends - which from memory the documentation indicates is only appropriate for single server use cases. It does sound like people might be expecting Vault to implement the functionality in, for example btrfs, zfs etc.

I am suggesting the documentation that has been provided could be considered to close this issue - it points out the traps-for-young-players (e.g. the dynamic secrets, user state, HA etc etc.).

I did ask if the expectation is that all backends be documented - that was a genuine question.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hashicorp/vault/issues/5683#issuecomment-662317209, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABY5K4VOBYDTHLYDRK7U2MDR42O3PANCNFSM4GB3BH7Q .

bbros-dev commented 4 years ago

How can a consistent dump be application specific?

You application may not require that two clients issuing a read request must get the same data - so your backup requirement is only the data be whole as at some point in time - what qualifies as 'whole' is specific to your application - Vault can never know that detail - likewise Vault can and should never know whether your HDD flushed to disk when your systems crashed as your backup finished.

if Vault supports this, or not .. so we can gracefully stop Vault, take a backup, and restart

Sorry, I'm struggling to understand what is it in Chris Hoffman's statement that leaves this an open question. Sometimes I find confusion can be cleared if you provide a statement that would answer the question/issue to your satisfaction - then maybe we can see where any misunderstanding arises.

Coming back to why the missing snapshot feature

There is no missing snapshot feature in Vault - if you want snapshots, then Vault allows you to choose the backend that provides snapshots - see 1) the reference architecture document 2) the article I linked to - it provides you with snapshots - again remember Chris Hoffman's general guidance applies - shut down Vault before making any data backups and 3) @jefferai's comment

Snapshots can be a feature, or not, of the backend you chose to connect Vault to - or you can implement the 'snapshot' of the backend data your self (see @mazzy89's comment). I agree the backup section of the Reference Architecture document was terse and the support article I posted a link to provides added detail.

More generally the HashiCorp guys have been very clear about what you need to do to get a backup of Vault data that is in a consistent state:

Backups are always based on the storage backend you are using and do require Vault to be offline to ensure consistency.

Yes snapshots are very useful - which is likely one reason why Consul is the recommended backend in the reference architecture. All the the FUD about Vault not being production ready is disappointing.

We are still using Vault but looking for alternatives

No argument here. There are better tools for some use cases. However, IMO, the lack of production ready data backups would not be a reason to choose another tool - contrary to the FUD peddled above.

vladimir-avinkin commented 4 years ago

@bbros-dev I'm sorry, i have completely missed all of your points your tried to present in messages above. You linked an article which suggests taking atomic snapshot of the consul cluster to backup your vault instance. At the very least, this article does not make clear that those backup are actually not consistent, since as we know you can't guarantee consistent state without stopping vault first (or maybe we don't know that, since all we have is a message on email thread, this is the point of this issue in the first place).

I would bet that that article tricked a lot of people into thinking that taking consistent snapshot of their storage would guarantee consistent vault state, when it's actually would not.

I'll try to sum up the whole issue:

Currently, there are no known generic and concise documentation describing backup and restore procedures, moreover, apparently, vault uses storage backends which can guarantee data consistency in such a way that it can't actually guarantee a consistent vault state with each write. Without explicitly stating it in documentation that last point would trick a lot of people into believing that taking atomic storage level backups would guarantee valid vault state when restoring, which is not the case.

weakcamel commented 4 years ago

To re-iterate what I had in mind raising this ticket: IMO Vault - as great product as it is now - is currently missing a clear and definitive information on how to take a backup of its state and restore it in a "guaranteed" way.

Obviously a "hot" and backend-independent backup would be ideal and preferred; an implementation of #7191 would be more than welcome as well. That said, those are just items on my own personal wishlist and are out of scope for this issue.

What I would hope to see would be a "backup and restore manual" reading something like this hypothetical:

To take a backup of Vault data, you need to:

shut down the application (all of the nodes if running as a cluster) - if that's what it really takes

take a copy of the backend data per their own backup manuals (e.g. copy the S3 bucket, take a database backup, ...)

in case of filesystem backend (as that's Hashicorp's own implementation): these are the directories you need to preserve and restore

start Vault again

To restore Vault's data to a previous point in time, you need to: ...

Re @bbros-dev

@weakcamel can this issue be closed?

I don't believe it should be closed. It's IMO as valid as at the day I raised it :-) what we as Vault users need/would like to see is a clear, official documentation by Hashicorp that covers this essential operation procedure. I double checked right now and unless I missed it, I didn't find a relevant section here: https://www.vaultproject.io/docs.

A discussion thread on Github issue between the users - no matter how interesting and stimulating - is not binding in any way and it doesn't count as a documentation source.

A document describing migration of Consul data is also not sufficient to cover Vault backups, especially that it doesn't address the statement from this comment above - it actually suggests that Vault does not need to be shut down to take a consistent backup of its data (so it contradicts @jefferai 's advice).

bbros-dev commented 4 years ago

Thanks for responding - this sheds light on what should be added to the documentation - I'm considering a documentation PR, but I'm struggling to identify much of substance (apart from Chris Hoffman's succinct statement) that is missing - right now it looks like an enhancement of where information is located and may be improved wording and emphasis by repetition.

To take a backup of Vault data, you need to: \<snip>

That is useful thanks for sharing it - may be add it as an update in the OP statement about that you expected.

Obviously a "hot" and backend-independent backup would be ideal and preferred;

No that is not obvious. Sorry, but that would be you imposing your preferences for architecting an application on to others. As you say it is out of scope of this issue, but please mention this issue if you open a issue requesting that feature. Having said that.... HashiCorp do seem to allow you to get 'live' snapshots [working, but (obviously) possibly inconsistent between Vault instances] via the Consul backend. This has been documented in the Vault Reference Architecture and Consul documentation.

an implementation of #7191 would be more than welcome as well

Can you elaborate on why? The filesystem backend is only for single instance use cases. So, specifically what functionality do you envision that Vault would add that ZFS, BTRFS and LVM don't already provide.

Vault - as great product as it is now - is currently missing a clear and definitive information on how to take a backup of its state and restore it in a "guaranteed" way.

This has me puzzled.
Vault can never offer that 'guarantee' feature so long as has the feature of allowing you to chose the backend (currently 22) you store your data on?

"is currently missing a clear and definitive information": Missing or poorly located? (again except Chris Hoffman's statement).

I double checked right now and unless I missed it, I didn't find a relevant section here: https://www.vaultproject.io/docs.

Yes but did you read the support article I posted and the Reference Architecture links? If those documents missing something (from your point of view) could you enumerate it?

I don't believe it should be closed.

No worries.

It's IMO as valid as at the day I raised it :-)

That's partially informative.
More helpful would be to itemize what is missing in the official Hashicorp Support document detailing Migration of Vault Data Stored in Consul, as well as some of the other existing documentation (see below)

what we as Vault users need/would like to see is a clear, official documentation by Hashicorp that covers this essential operation procedure.

Yes I understand that is very important to many users - and is why I'm staying engaged in this thread and have not ignored the thread as trolling or responded with RTM.

Do you agree that Chris Hoffman's advice is clear enough, i.e. it would be suitable to add to the documentation?

All the evidence I see suggests HashiCorp understand the importance of such 'official' guidance: There has been substantial effort put into creating the 'official' Reference Architecture documentation.

This did, and still does address this topic - albeit very tersely - in the section Corruption or Sabotage Disaster Recovery.

In their defense that section does link to Consul snapshots documentation which is more detailed. This is reasonable, Consul is the backend that HashiCorp have some influence over, and it is the backend in the Reference Architecture.

I hope you can see their (Hashisorp's) problem: Before reading on please review the CoreOS document that describes backing up and restoring using the etcd backend.

You should now appreciate that any backup/restore documentation is very specific to the backend you use. It does not naturally 'fit' in general documentation but is specific to the backend you chose and how you have configured that backend. By moving the backup/restoration section out of the Reference Architecture, and in to the "General" documentation, they could be accused of unfairly promoting the Hashicorp product suite, and would apprea to be promoting one set of architectural choices (albeit it is the Reference Architecture, but that isn't the only one that is sensible).

I double checked right now and unless I missed it, I didn't find a relevant section here: https://www.vaultproject.io/docs.

Yes, but did you review the other documents I provided?
I also get that you and many others expected the Backup/restore instructions to be in the "General" documentation section - however ti does not naturally fit there - since it is specific to the backend and how you have configured the backend.

Sooooo, I'm now wondering if we are not just talking mostly about where information is located? It would help to clarify, from your pov, just where where and how the documents I've linked to are deficient in terms of specifics.

A document describing migration of Consul data is also not sufficient to cover Vault backups,

Hang on, by the nature of the Vault architecture any discussion of Vault data backup/restore will have to be in the context of some backend and how that Vault+backend have been configured.

If Consul is 'not sufficient' please say what backend would be?

Or, back to my earlier question, is the view that all 22 backend procedures need to be documented by the community?

Have you read the CoreOS document?
I ask because clearly these are not trivial exercises for many backends (except in the Reference Architecture, where it is trivial). I also ask because it will make very clear that any discussion of such procedures is heavily intertwined with the storage backend and how you have it configured.

especially that it doesn't address the statement from this comment above - it actually suggests that Vault does not need to be shut down to take a consistent backup of its data (so it contradicts @jefferai 's advice).

Sorry that is plain false and borders on casting an aspersion on the developers.
I don't know @jefferai from a bar of soap but in his defense - Nowhere does he say anything like what you claim:

@pznamensky Sure -- take atomic snapshots at the storage level. Vault doesn't write everything transactionally because we can't rely on having that capability in storage, but instead we write the code such that a failure in the middle of a request can be tolerated. We do this in various ways, via how we order writes, using WALs, etc. We can always improve this, but the idea that Vault will be in some unworking undefined state if improperly shut down isn't the case, and thus atomic storage snapshots are also fine.

There is no mention of consistency in that statement. The only mention of shutdown is in the context of providing and assurance that everything is done (no doubt PR's welcome) to prevent Vault being in an undefined state when improperly shutdown.

No where does he "actually suggest" you don't have to shutdown to get consistency (the same data across all Vault instances). Both authors clearly state Vault does not write transactions. So in fact the opposite of what you claim is closer to the truth.

Specifically, (emphasis added):

We can always improve this, but the idea that Vault will be in some unworking undefined state if improperly shut down isn't the case, and thus atomic storage snapshots are also fine.

Hopefully you appreciate the logical connection of the "and thus", a working state does not have the same meaning as consistency of data between vault instances. Consequently, by design, backend snapshots are fine and will give you Vault data to restore a working instance of Vault. If you want more than that Chris told you what to do - that advice does need to be in the docs.

When you have a reproducible example contradicting his statement then make a PR or at least open an issue giving the details, but until then please stop spreading the falsehood that they have contradicted each other.

They certainly have confused you and some others. That is unfortunate. But I see no evidence they did so deliberately.
On the contrary they have tolerated a lot of falsehoods and aspersions being cast on their work in the course of this issue. Many projects wouldn't have elevated this issue to enhancement request, but would have closed it pointing to the existing reference architecture documentation as being complete - which it is, if imperfect.

weakcamel commented 4 years ago

@bbros-dev

That's partially informative. More helpful would be to itemize what is missing in the official Hashicorp Support document detailing Migration of Vault Data Stored in Consul, as well as some of the other existing documentation (see below)

As others already mentioned - what's missing is information on how to generally perform a Vault data backup using any other of 21 supported backends. This document describes one, very specific scenario.

And I'm not saying it should list all exact steps one by one - it should however explicitly state e.g. whether aplication (Vault) should or should not be running

Have you read the CoreOS document?

Yes. However it's not hosted within Hashicorp domain hence it's not relevant to this ticket.

Sorry that is plain false and borders on casting an aspersion on the developers.

You misunderstood me.

Contradiction is in following facts:

Migration of Vault Data in Consul - does this mention anywhere that Vault should be stopped while doing this migration to ensure data consistency?
advice by @jefferai does suggest this and it's probably a sound advice.

How can this not be a contradiction?

All I ask for is those 2 different sources of information become one. Ideally in a form of one single document, easily accessible.

Side question - why are you being so defensive on behalf or Hashicorp? Nobody's questioning their abilities or intentions. This is a request for improving technical documentation and IMO all such overinterpretations are on you.

weakcamel commented 4 years ago

Obviously a "hot" and backend-independent backup would be ideal and preferred;

No that is not obvious. Sorry, but that would be you imposing your preferences for architecting an application on to others. As you say it is out of scope of this issue, but please mention this issue if you open a issue requesting that feature. Having said that.... HashiCorp do seem to allow you to get 'live' snapshots [working, but (obviously) possibly inconsistent between Vault instances] via the Consul backend.

From a user / sysadmin perspective - yes, it's obviously preferred. I didn't say it's mandatory but saying that there's no obvious benefit in taking backups of an application without having to stop it - it's just plain wrong.

how to take a backup of its state and restore it in a "guaranteed" way.

This has me puzzled. Vault can never offer that 'guarantee' feature so long as has the feature of allowing you to chose the backend (currently 22) you store your data on?

So you're saying it's impossible to offer advice on consistent backup-restore while having multiple storage backends? I disagree. Others did it (Jfrog Artifactory for example; Gitlab). And I'm sure Hashicorp can too, it just hasn't happened yet - hence this request. I don't understand the reasons for trying to ridicule the idea of data consistency; Vault is just an application and as every stateful application it's only reasonable to be able to consistently preserve its data in some way.

rfc1036 commented 4 years ago

Mentioning "ZFS, BTRFS and LVM" misses the point. File system snapshots, at file or block level, are only meaningful if what Vault has written to disk at that point is consistent. This is why it was suggested in #7191 that it would be useful to have an API to make Vault fully commit its current state to disk and pause further writes. This is why when people use LVM to do a mysql snapshot they have to flush and lock the database tables before actually creating the volume snapshot.

Also, arguing that this is useless because applications may write to Vault in a way which is not consistent is not very useful because other applications may operate in a consistent way (and I strongly suspect that most of them do), hence making a snapshot of the internal state of Vault a useful backup for many use cases.

Let's assume that applications write to Vault in a consistent way or that writes are paused by an application-specific mechanism: if at that point it is still not known if what Vault has written to the disk is self-consistent or not then it follows that a Vault instance using the file system backend cannot be backed up reliably. I will let the readers decide if an instance which cannot be backed up can be useful in a production environment or if it should be only used for testing.

I do not believe that it is reasonable to argue that the proper way to do backups is stop/start/unseal because in many scenarios unsealing Vault is an exceptional event: if I had to keep the unseal key online then I could as well use any generic database.

bbros-dev commented 3 years ago

@mouzfun:

At the very least, this article does not make clear that those backup are actually not consistent

Correct, it does not. Nor should it make that claim - because whether the Vault data is consistent, or not, between Vault instances when writing to a Consul backend, will depend on how you have Consul configured. I agree they could have stated the default Consul configuration is near strong consistency but at some point people have to read documentation - the Reference Architecture is a good place to start. Obviously, anyone implementing Vault with the Consul backend (or any other) in production would know and have extensively tested exactly how they have it configured.

Remember Chris Hoffman's advice is an approach that applies to ALL backends - as I have repeatedly acknowledged this does need to be added to the general section of the docs (i.e. is not backend specific). The Consul snapshots are data snapshots that are in a working state from Vault's point of view - per @jefferai's comment and in the absence of any issue showing/claiming the opposite (as @jefferai also acknowledges - there is always room for improvement). Whether the data is 'complete' or in a working state from your applications point of view, Vault can never know.

since as we know you can't guarantee consistent state without stopping vault first

No.
That is not what Chris Hoffman or any other Vault developer (that I am aware of) has suggested.

Chris did provide one simple general way to ensure consistent data from any Vault configuration (of which there are many) with any backend storage (of which there are many) whatever the backend storage configuration - of which there are many.

He did not state, suggest or hint that it is the only way.

Obviously, for some backends there maybe be a configuration that does provided a snapshot of data stored by Vault that is consistent across Vault instances without shutting down Vault (e.g. Consul is one such backend that has this option - hence it is in the Reference Architecture and, while Consul's default configuration is nearly strongly consistent, that is not enough reason to claim the data will be consistent - and so the article does not make the claim, likewise it does not make the claim the data will be inconsistent with what is on another Vault instance). Hopefully that is clear? Please say if not.

all we have is a message on email thread

In fairness you also have all the documentation for the backend you are using and how that backend can be configured. You also have had the Reference Architecture, so you have quite a lot - all of which takes some time to study and fully understand. Again I agree this issue genuine, but I think it is correctly tagged as an enhancement, and the placement of that information can be enhanced. I do not agree with @weakcamel's claim that this (mis?) placement means it is a critical high-priority defect.

IMO, no one should take Vault into production after just skimming some documentation - it is novel software that you should have at least implemented and understood as set out in the Reference Architecture - or if you're in a rush and need Vault in production while you are at the crawl phase of your Vault journey - take up Hashicorp's Enterprise option. Or go thought the processes of learning to walk, jump, jog, run with Vault.
All the information is there but it takes time to work through. It is not like they have left anyone hanging without a solution.

The Vault docs also repeatedly acknowledge that Vault adoption is a crawl, walk, run process. That does not mean Vault is not immediately ready for production. It does mean a team/organization is liekly not immediately ready. That said, there are applications architected similarly to Vault, and people with that experience are likely to quickly implement Vault in production - once you've worked with these applications this backup 'issue' is a non-issue and immediately obvious from the high level (generl) documentation. Without that experience it is far from obvious. Hence why I've persisted with this issue.

It does mean that no one without prior Vault experience (or experience with a similarly architected application) will immediately be able to run with Vault in production.

Having said that, Chris Hoffman's advice does suggest someone might be willing to take Vault into production when they are at the crawl stage, but that would depend on their circumstances and their risk appetite.

I would bet that that article tricked

I think that is an uncharitable characterization - I accept you may not have understood the context for not making the claim you wanted then to make. But for a short article (which is already actually quite long) it has to draw the line somewhere.

Bear in mind that Chris Hoffman's advice is much clearer and generally applicable and does get you Vault data that is consistent across Vault instances that are running on any backend.

Currently, there are no known generic and concise documentation describing backup and restore procedures

Yes, in a narrow sense that is true - more broadly it is only true if you ignore the Reference Architecture, And who would be crazy enough to take Vault into production without at least having implemented and studied that before taking their own implementation into production?

Please bear in mind, IMO, that no-one authoring what you are asking for would be comfortable about it - you are asking for very simplistic guidance and assurances for a complex piece of critical infrastructure. Also bear in mind that there has always (as far as this issue is concerned - it predates it by at least 6 months) been the Reference Architecture's relevant section (which I have repeatedly linked to). That is complete, it correctly emphasizes the DR/HA configurations that should be used and refers you directly to the relevant Consul documentation in backup data discussion.

Hashicorp are likely reasonable happy that after implementing, configuring and testing that reference architecture, someone would be reasonably confident to take their own Vault setup into production. Also Hashicorp have always offered support to any Enterprise which, IMO, is what you need if you want to run Vault in production when you are at the crawl phase of your Vault journey.

apparently, vault uses storage backends which can guarantee data consistency in such a way that it can't actually guarantee a consistent vault state with each write

No that is not true. @jefferai states the opposite. But, time constraints mean I'll have to give up at this point.

@weakcamel:

How can this not be a contradiction?

If you review your two points you'll see that to sustain your argument about Chirs Hoffman and @fefferai contradicting each other you cite something neither of them said (which also happens to be an article I provided to support the view that there was no contradiction in what those two authors were saying). Further your point concedes that "advice by @jefferai does suggest this..." (shutdown to ensure consistency) "... and it's probably a sound advice".
There is no contradiction. Again remember both acknowledge that Vault cannot write data transactionally because not all backends support that.

Also please review my response to @mouzfun since it address some misunderstanding about the article you refer to, and could be relevant.

saying that there's no obvious benefit in taking backups of an application without having to stop it - it's just plain wrong

Please link to where I have said that, and I will retract/~~strikeout~~ the claim. What I have said is:

I also am not saying that for the backend someone selects and the way they wrote their application (or most likely inherited a legacy application) that a consistent dump is not sensible.

I have also been at pains to emphasize that while Chirs's advice is general, you can get consistent data backups from other backends, without halting/sealing Vault - Consul is such a backend.

Furthermore, you state.

So you're saying it's impossible to offer advice on consistent backup-restore while having multiple storage backends?

I believe that is a mis-characterization of what I have said.
I have repeatedly acknowledged that Chris Hoffman's advice needs to go in the general section of the documentation. I also suggested that you add your expected text to the OP of this issue.

Others did it (Jfrog Artifactory for example; Gitlab)

Yes that is a common misconception, one I've held at some point I'm sure: If A is a server then you must be able to make it behave like B because B is also a server.

To decide if what you ask for is possible, consider if the corollary is true:

Can Artifactory or Gitlab be configured in the same specific ways that Vault can be?
Do those configurations offer the same preformance, availability and partionability that Vault does?

When you are implementing those configurations of Gitlab and Artifactory you will likely encounter a point where their devs respond 'we can't do that and ensure consistency'. Note, in good faith, I am assuming your claims about the consistency of Artifactory and Gitlab are correct. Also note that Vault does offer the consistency you seek, just that the details tend to be backend specific, and to repeat - that is with the exception of Chiris Hoffman's advice.

~~You repeat this claim:~~ My mistake this was a composition error on my part. Apologies. ~~> So you're saying it's impossible to offer advice on consistent backup-restore while having multiple storage backends?~~

~~I can only refer to what I have said, which is the opposite of what you claim.~~ ~~I have repeatedly said that Chirs Hoffman's advice does need to be added to the documentation.~~

@rfc1036:

Mentioning "ZFS, BTRFS and LVM" misses the point.

Apologies, if it did But I believe it focused attention and elucidated the point at issue, which is what it was intended to do....

File system snapshots, at file or block level, are only meaningful if what Vault has written to disk at that point is consistent.

If I understand correctly, the filesystem backend is only to be used in single Vault server setups: So exactly what can the data be inconsistent with? There is only one instance of Vault running.

This is why it was suggested in #7191 it would be useful to have an API to make Vault fully commit its current state to disk and pause further writes.

Vault does already have that API. It is the seal operation - I accept it sounds like many don't like it's semantics, and may even seek to have them changed. (That does not mean vault is not production ready) And if you need to do this often, and if your unseal process is considered painful by the people involved (it often is, but not always, in some orgs somthing like it has always been a part of the daily routine), Vault offers you the auto-unseal functionality to allow these pauses to go ahead frequently and not be so disruptive.

It could be renamed pause_all_writes, but really? Are you taking Vault into production without understanding that a sealed Vault cannot be written to? Are you taking Vault into production without having read the documentation (at least once) or at least investigated your pain points - google "auto unseal Hashicorop Vault" should return something. No?

Remember none of this Vault functionality addresses the critical issue - you have make your app respect Vaults 'pauses' regardless of how you make it pause, and for how long you make it pauses.
Vault is a critical piece of security infrastructure - it can't just arbitrarily buffer things for you across all instances while writing to some backend that has been configured such that it responds with glacial time frames. Imagine the outcry when your apps secrets are lost because some DoS meant a Vault buffer overflowed while you paused it to run backups. Again, Vault has a pause/flush-and-sync-all-writes (from its pov) functionality: it is the seal action. The added benefit of this current API is that is forces responsibility for buffering secrets where it belongs - in the hands of the developer who is managing/writing the application.

This is why when people use LVM to do a mysql snapshot they have to flush and lock the database tables before actually creating the volume snapshot

Understood. Vault allows you to do that too, if you need, with the database backends. Again the Reference Architecture uses Consul which can also be configured to provided consistent data across multiple Vault instances - that too like the DB backends, and all others, is not without trade-offs and caveats.

if at that point it is still not known if what Vault has written to the disk is self-consistent or not then it follows that a Vault instance using the file system backend cannot be backed up reliably

Again, @jefferai has explicitly refuted this allegation of Vault being broken - in the absence of evidence to the contrary. Again, everyone has been clear Vault does not write transactionally. Obviously you will have seen all the documentation around the DR/HA configurations that are, IMO, well documented, and provide the production level of assurances needed.

Perhaps you can setout the specific scenario you have in mind?

I understand you likely want @jefferais claim that Vault isn't broken wrt writing data that cannot be decrypted, also put in 'writing', but both Chris Hoffman and @jefferai have stated Vault cannot write transactions because not all backends support transactions. Obviously, if you want such assurances, then chose a backend with that feature and/or configure a DR/HA setup to do what you want - those are also documented.

I do not believe that it is reasonable to argue that the proper way to do backups is stop/start/unseal because in many scenarios unsealing Vault is an exceptional event: if I had to keep the unseal key online then I could as well use any generic database.

Well it likely is more reasonable if you bear in mind two things: 1) That Vault's role is not as a database, rather it is a crtical piece of security infrastructure. Yes, both store data, but Vault is a piece of security infrastructure not data infrastrcuture, and (perhaps unfortunately to some minds) that necessitates different workflows. 2) Vault does allow you to minimize the manual unseal inconvenience with the autounseal feature - it also provides you with several avenues for implementing that feature - if that is the tradeoff you are willing to make. Also it does so, if I recall correctly, for the very reason you state. 3) Vault does allow you to chose a backend that supports different levels of data consistency - givn Vault the flexibility to fit in to most app architectures.

SayakMukhopadhyay commented 3 years ago

For our case, we are just starting out, and looking at this discussion I think we will go ahead with Consul as the backend. In our case, we are planning to deploy the stack on Kubernetes. Does anyone have any experience backing up with Vault+Consul on k8s. I am thinking of creating a CronJob that runs maybe hourly to take snapshots of Consul.

Is it necessary to bring Vault down in order to take snapshots of Consul.
Should I bring vault down by reducing the replica count to 0?
Can I save the snapshot remotely? Since I am running on k8s, I dont want to save on a mounted volume as I am afraid that the mounted volume itself can be deleted accidentally thus destroying my backups.

Essentially, I want to have a DR plan with the OSS version and the only way I think that can be setup easily is to backup and recover properly.

AlexOQ commented 3 years ago

@bbros-dev I believe you're fending off legitimate claims for clarity regarding Vault's functionality/intended design with ambiguous unhelpful rhetoric, a-la "What else do you want from me?" Let me give you an example.

At the very least, this article does not make clear that those backup are actually not consistent

Correct, it does not. Nor should it make that claim

Note the difference between words in bold.

Hashicorp Vault assures that <insert_backup_strategy_here> retain data consistency, or Hashicorp Vault can not assure that any backup strategy will retain data consistency. Both of these are clear. Maybe incorrect, but clear. You can figure the 'correct' part once you've mastered the 'clear'.

Why don't you consider adding helpful and clear answers by @chrishoffman in the mailing list (from OP) to the documentation? Seems you're regarding him with some authority on Vault's functionality.

He did not state, suggest or hint that it is the only way.

How about making it clear in the Reference Architecture? Again, just write something along the lines This is not the only way to milk a cat Just so people using Vault will know they have a shot of trying different approaches.

Also, this comment is as useless as an underwater umbrella. When mommy and daddy love each other very much a new life is conceived Please note that I've not stated other, mostly discordant approaches to the conception. Maybe hinted.

Seriously, people here really have an issue with the lack of clear documentation, preferably in one place instead of scattered among mailing lists/github issues/docs. We're not even looking for production ready/unfit statement. We're big boys, we can decide if anything is fit for each his own unique production system. Just a "it works like that, but can not assure ".

EDIT: Don't bother trying to extract from context single words, except clear. You write a lot, but sadly in an unhelpful and defensive manner.

bbros-dev commented 3 years ago

@SayakMukhopadhyay you might also want to look at the Internal Storage or 'Raft' storage as an option. In some sense it is simpler by not involving Consul.

Bessonov commented 3 years ago

Disclaimer: I didn't read all comments, it just too much.

First of all, I'm very surprised, that there is no solution (I mean a solution, not a workaround like down/backup/up) to make a consistent backup of, probably, most critical infrastructure part beside the databases. Furthermore I don't get why it should be a storage problem? From my point of view it doesn't make any sense. Is there any database, which allows to create a consistent backup from underlying storage while operating? The databases do it on application level, like push binary/incremental files, dumps (some blocking, some with time travel) etc. Because only the database can make assumption about the consistency. After all Vault is able to work with the operational state - why not to use it for backup, instead of storage state?

After that I don't get why it must be the same storage backend, which leads to the consistency problem after all? I like the idea from @dr4Ke . But it seems like it just copy data from one place to another. Why not make it possible to configure a (write-only) storage backend, which is used only for backup operational state?

zerkms commented 3 years ago

Sorry for the offtopic, but:

Is there any database, which allows to create a consistent backup from underlying storage while operating?

almost every mainstream RDBMS would do. It's often done using the filesystem snapshots and works just fine.

evilhamsterman commented 3 years ago

@zerkms that is incorrect most every mainstream RDBMS recommends that you quiesce the database some how before you take a file system snapshot. Typically it is something like quiesce the db, snapshot the filesystem, unquiesce the DB, backup from the snapshot. Failure to do so may work most of the time but is not guaranteed.

MSSQL uses VSS for this, MySQL can flush and lock the DB for filesystem snapshots From the MySQL docs https://dev.mysql.com/doc/refman/8.0/en/backup-methods.html

Making Backups Using a File System Snapshot

If you are using a Veritas file system, you can make a backup like this:

From a client program, execute FLUSH TABLES WITH READ LOCK.

From another shell, execute mount vxfs snapshot.

From the first client, execute UNLOCK TABLES.

Copy files from the snapshot.

Unmount the snapshot. Similar snapshot capabilities may be available in other file systems, such as LVM or ZFS.

Many will also have other backup options like mysqldump or pg_dump but those do not involve filesystem snapshots

Bessonov commented 3 years ago

@zerkms Oh, damn, you're right about snapshots. But for me it seems like it's because RDBMS devs put much effort in the architecture of inconsistency resolution/replay of actions.

@evilhamsterman I've checked it for postgres:

An alternative file-system backup approach is to make a “consistent snapshot” of the data directory, if the file system supports that functionality (and you are willing to trust that it is implemented correctly). The typical procedure is to make a “frozen snapshot” of the volume containing the database, then copy the whole data directory (not just parts, see above) from the snapshot to a backup device, then release the frozen snapshot. This will work even while the database server is running. However, a backup created in this way saves the database files in a state as if the database server was not properly shut down; therefore, when you start the database server on the backed-up data, it will think the previous server instance crashed and will replay the WAL log. This is not a problem; just be aware of it (and be sure to include the WAL files in your backup). You can perform a CHECKPOINT before taking the snapshot to reduce recovery time.

EDIT: But this raise the question: how vault handle crashes, if there's no consistent state on storage? Does it just breaks unrecoverable? :eyes:

EDIT: @zerkms ACID is a thing of operational state, not of storage state.

zerkms commented 3 years ago

@evilhamsterman given mysql is ACID - just taking a snapshot (without any other ceremonies) guarantees the state to be consistent. It may be not latest with millisecond accuracy, but it will be consistent. If you can think of a scenario when it's not the case - it's a bug and must be reported.

hashicorp / vault

documentation for backup and restore of Vault #5683