Closed Luzifer closed 7 years ago
We had another report of this, but in that instance the person just tried it again and it worked fine, which doesn't really help us debug it. Two of us on the Vault team tried to reproduce this and couldn't. I know it sounds stupid, but can you try it again and see if it works? Otherwise I can try putting together a build for you that spits out a bunch of specific debugging.
Tried again multiple times:
A debugging build to help you find the reason sounds great! :)
@Luzifer Are you comfortable building on your own, if I put some code into a custom branch?
Yep, that would work for me.
Hi @Luzifer
I pushed up a branch: https://github.com/hashicorp/vault/tree/cubbyhole-upgrade-debug
Can you run that with log level trace and provide the full log as output? (If you don't want to post it publicly for some reason, email it to vault at <my company's domain>). Thanks!
@jefferai sorry for the delay but I've got a paste for you:
http://pastebin.com/raw/WWDmYRyC (expires in 1 month)
cubbyhole-upgrade-debug (v0.7.0-1-g639004f)
branchmake bin
pkg/linux_amd64.zip
-log-level=trace
Otherwise same config as for 0.6.5
was used.
@Luzifer Just to check -- this is data from 0.6.5 (e.g. that you restored from a backup)? So, you took data that was working fine on 0.6.5, then performed this upgrade once?
Yep, thats from our staging system which was cleanly created on 0.6.5, then shut down, had a DynamoDB dump taken and restored from that dump…
So in the end it's an empty DynamoDB table with applied vault init
and added some data using automated vault write
to /secret
and /aws
… The /cubbyhole
path was not touched by us…
Yeah, something is super duper weird. I'm looking through code in both 0.6.5 and 0.7.0 to see if I can figure out what's going on here, but so far no dice.
If I made the same changes to a 0.6.5 branch would you mind getting me logging from that?
We can do that if it helps! :)
It would for sure. The new one is called cubbyhole-upgrade-debug-0.6.5 -- very imaginative I know!
http://pastebin.com/raw/Adayj60u < aaand the paste is finally there…
OK, thanks -- that's actually very helpful.
@Luzifer As an update, so far I've not figured out the cause here. It's strange in that it goes from 1 entry in 0.6.5, then, on what you said was the first run of 0.7 from the fresh 0.6.5 dataset, up to 38 entries for Cubbyhole. But, in your 0.7 log that you sent, it doesn't then increase from the first time that it tries/fails to set up to the next time -- it's 38 both times. If it was 39 the second time, it would point to something that is causing an append of a new cubbyhole mount on each setup, but the log doesn't show that. So at the moment I'm at a bit of a loss as all I know is that 38 cubbyhole entries were supposedly found from the value persisted to disk.
I'm going to keep thinking on this. Is there any chance, if I can't figure anything out, that we could possibly do a screen share or some kind of real time debugging next week?
We can try that… Just two things to mention: Next week I'll be in UTC+2 and I'm not an english native speaker… ;)
I'm hitting the same issue with Consul as the HA backend, though I'm just using 2 Vault-nodes for now. My setup is loosely based on this configuration of running HA Vault with Consul in Kubernetes.
I'm happy do assist in debugging, though I have not yet used Vault in production so far. This happened in a lab environment.
For now, I'll wait if your screen sharing session yields any results.
Unfortunately It's unlikely screen sharing will happen this week as I'm onboarding two new people so I'm on the west coast, which means there's currently a ten hour time difference between me and @Luzifer this week.
@dictvm @Luzifer I added some more debugging to both branches. It would definitely help to get both from both of you. Please ensure that you get the info from the first run of 0.7 against your 0.6.5 datasets.
I had the same issue upgrading from 0.6.5 -> 0.7.0 using the DynamoDB HA backend. I can't do much debugging at the moment because I just spun up a new Vault cluster instead :(.
Here's the trace output from the latest branch versions: https://pastebin.com/raw/2K9Lpmzd (expires in 1 month)
Testing code used:
cubbyhole-upgrade-debug-0.6.5 (v0.6.5-2-g58a51d3)
cubbyhole-upgrade-debug (v0.7.0-2-g0d832f4)
make bin
and the ./pkg/linux_amd64.zip
artifactNotes:
0.6.5
vault init
, DynamoDB copy)0.6.5
release instance was in charge since then0.6.5
release instance was replaced by ip-10-8-4-46
running 0.6.5
testing codeip-10-8-4-46
was replaced by ip-10-8-3-131
running 0.7.0
testing codeHi all,
We think we know what's going on, and why in the first set of logs we didn't see additional cubbyholes being added, and why in one case we heard about this problem fixed itself.
What's happening is that you are upgrading a single instance in an HA cluster to 0.7, and it's then losing leadership and causing a 0.6.5 instance to take over. You can see this in in the log, directly after the first 0.7 startup:
Mar 29 06:55:52 ip-10-8-3-131 vault: 2017/03/29 06:55:51.672603 [WARN ] core: leadership lost, stopping active operation
(A separate question would be why it's losing leadership so quickly.) At this point, leadership flips to a 0.6.5 instance; the 0.6.5 instance see that no cubbyhole backend exists (because it doesn't know how to read the new location for the local mount table introduced in 0.7) and adds a cubbyhole backend. Then the 0.6.5 instance flips back to the 0.7 instance for some reason, and the 0.7 instance reads two cubbyhole backends from persistent storage; when it tries to mount both it fails, causing it to flip back to 0.6.5, and this cycle simply repeats endlessly. If in the interim leadership didn't flip back to a 0.6.5 instance but instead goes back to the 0.7 instance, you'd see the same number of cubbyhole mounts in the table, which is what we saw in your first log where the number didn't increase.
If you look at our documented upgrade procedure for HA installations (https://www.vaultproject.io/docs/guides/upgrading/index.html) you'll see that we suggest stopping standby instances, updating them to the latest version, then sealing/stopping the active instance so that one of the upgraded standbys takes over and the remaining older version does not gain leadership. This is specifically because we do not support downgrades, only upgrades.
The good news is that you ( @Luzifer ) backed up before upgrading, so you should be fine if you upgrade according to the documented procedure. For anyone that chimed in that doesn't have a backup, there isn't much in the way of "things we can do to fix it" since from a Vault perspective it doesn't know which one is the "real" cubbyhole backend that may currently have data in it. It may suffice, if you've upgraded to 0.7 already but are having this problem, for you to delete the key from your storage backend that corresponds to the storage path core/local-mounts
. You'd lose any other local mounts as well but unless you're a VE customer using replication you probably don't have any of those. The data currently in cubbyhole will become inaccessible but I think it'll "fix" the problem.
Hmm that's a really interesting point which means we need to find another method of rotating our cluster. Currently we're relying on AWS autoscaling rolling updates which can replace any machine. So in the end it can first replace two standby instances and afterwards the master which would lead to a successful migration or it can mess up everything in case there is still an old standby and the master looses leadership…
It's definitely an important question why the 0.7 instance looses it leadership and regains it immediately (while somehow the data is messed up in the meantime by an old instance)…
Another question is whether the underlying data model is getting more robust to this kind of changes in the future. For example if the dataset would have a version an older Vault server could refuse to touch that dataset if the version is too advanced for itself…
Until this is handled automatically we need to apply some manual update to the cluster which isn't our preferred way to work (we tend to cut out every manual interaction on the cluster itself) or we need to implement an own update mechanism aside the AWS ASG rolling update which is aware of the state of the cluster nodes applying the suggested policy to first stop the standby instances and afterwards the master.
Would be great to get input on the suggestion regarding versioning the dataset / data model from you before we invest time into building a cluster manager which works around the AWS ASG behaviour…
@jefferai @Luzifer has a point, i just ran into similar issue but on the dev setup so i haven't worried much about the issue and rolled out new instances with the newer version but if its a prod setup then this would be a major hurdle for us since we use ASG rolling updates as well. If there is some pointer to avoid such mitigated scenarios then it would be great.
To be perfectly honest, whenever anyone has asked how to run Vault on ASG, our answer has always been "don't". It requires some decent amount of know-how/alerting to ensure that you don't end up with multiple sealed Vaults (e.g. if Vault goes down, so ASG starts up a new one, but you don't realize that and it stays sealed, and this happens three times).
I don't know how ASG rolling updates can be controlled; internally at HC we simply use terraform applies to green/blue it -- scale down the standby nodes, scale up the new nodes, perform the switchover via seal, remove the final old node.
Any explanation of why this is occurring on a non-HA cluster?
Hi @broamski
Not off hand, how is your vault configured and what steps are you taking to upgrade? Would you be willing to share your operating logs, potentially using the same debugging branches @jefferai mentioned above?
@Luzifer I think the main problem at the moment is just understanding when the data model should bump, because everything is schema-less in the sense that each backend has its own data model and Vault's core and components have their own data models. Also, once you rely on this version to be up-to-date, forgetting to bump it becomes a real potential problem. So up to now we've stayed away from it.
Just to close out this issue, we followed up with @broamski offline and his situation is resolved. No code changes necessary. We do plan on putting some versioning information into the data store.
Hi, I have the same issue. My configuration (pseudo HA): 1) two instances, both unsealed, transit mode 2) one remote mysql backend 3) traffic goes to the first vault instance (VIP by keepalived)
I upgraded standby instance - all was ok. But when I upgraded primary instance I got such error:
2017/04/26 08:05:23.607726 [INFO ] core: vault is unsealed
2017/04/26 08:05:23.608452 [INFO ] core: post-unseal setup starting
2017/04/26 08:05:23.609326 [INFO ] core: loaded wrapping token key
2017/04/26 08:05:23.611100 [INFO ] core: successfully mounted backend: type=generic path=secret/
2017/04/26 08:05:23.611346 [INFO ] core: successfully mounted backend: type=system path=sys/
2017/04/26 08:05:23.611494 [INFO ] core: successfully mounted backend: type=transit path=transit/
2017/04/26 08:05:23.611566 [INFO ] core: successfully mounted backend: type=cubbyhole path=cubbyhole/
2017/04/26 08:05:23.611620 [ERROR] core: failed to mount entry: path=cubbyhole/ error=cannot mount under existing mount 'cubbyhole/'
2017/04/26 08:05:23.611672 [INFO ] core: pre-seal teardown starting
2017/04/26 08:05:23.611760 [INFO ] core: pre-seal teardown complete
2017/04/26 08:05:23.611796 [ERROR] core: post-unseal setup failed: error=failed to setup mount table
Hi @jefferai ,
We are getting the same isssue. We have a single vault instance with a consul backend and attempting an upgrade from 0.6.4 to 0.7.2.
We have attempted the upgrade multiple times and stuck with the same issue.
2017/05/16 12:09:03.064874 [INFO ] core: successfully mounted backend: type=cubbyhole path=cubbyhole/ 2017/05/16 12:09:03.064888 [ERROR] core: failed to mount entry: path=cubbyhole/ error=cannot mount under existing mount 'cubbyhole/' 2017/05/16 12:09:03.064900 [INFO ] core: pre-seal teardown starting 2017/05/16 12:09:03.065014 [INFO ] core: pre-seal teardown complete 2017/05/16 12:09:03.065024 [ERROR] core: post-unseal setup failed: error=failed to setup mount table
@gobins What is your exact upgrade procedure?
I attempted to do an upgrade from 0.6.2 to 0.7.2 last night and ran into this issue with cubbyhole. We were only able to use one node against our 3 node consul cluster prior to this because we were running into issues with Vault losing leadership about every 90 seconds so there were no standby nodes. No matter which node I tried upgrading we upgrading I couldn't get 0.7.2 to start because it wouldn't unseal due to not being able to mount cubbyhole with the exact issue above.
@jefferai your response here cleared up our confusion: https://groups.google.com/d/msg/vault-tool/rOmQ5_PXj4Y/5iPZHRdlBQAJ
The following is the sequence of events that led to our issue:
I have the same problem. gobins, how to delete core/local-mounts?
@avinchakov we use consul as our storage backend. If you are using consul, you should be able to do "consul kv delete vault/core/local-mounts".
No, I use Mysql. I tried to use requests to sys/raw/core/local-mounts but it don't work.
@avinchakov you will have to log on to the mysql db and remove the records manually.
I did it yesterday. Worked well, thank you
How we got there:
/cubbyhole
prior upgrade: Did not work (HTTP400
* cannot unmount 'cubbyhole/'
)(As a side note: I'm using another vault setup myself using S3 backend, same upgrade strategy did not cause any trouble…)