Vault goes down in production so frequently

ruchi15 commented 8 years ago

Hi, We are using vault in production with consul as a backend with HA available. We see that vault goes down so frequently. we have already enabled monitoring of the servers of vault and consul and whenever they down, we capture that. But the issue is why it is going down so frequently and there is nothing coming in audit logs or the server logs which is appended to the server when we start vault.

please comment or tell me the reason as it is a very critical system for us. Any comments would be appreciated.

jefferai commented 8 years ago

@ruchi15 Without logs from Consul and Vault there is absolutely nothing we can tell you, as it does not go down regularly for us or most other users.

skippy commented 8 years ago

Hi @ruchi15,

as jeff mentioned, it is a guessing game without logs. I'm not sure what you mean by 'down'; the one issue we constantly ran into is that if consul is undergoing lots of leader elections, vault because unstable. We didn't have to unseal any of our 3 vault instances, but anything connecting to vault would be rejected until consul stabilized. And there is (for our infrastructure) a 1-3 second delay between a new consul leader being accepted and vault starting to accept connections.

All this makes perfect sense. For us, we just had to make sure consul was stable to have a stable vault experience.

Once consul was stabilized (larger ec2 instance, isolated instances so something else wouldn't peg cpu, disk, or network IO) and upgraded to Vault 0.6 to take advantage of the active.vault.* DNS for our routing, and things have been wonderfully stable. @ruchi15 you may want to check out the form: https://groups.google.com/forum/#!forum/vault-tool as this has come up a bit and folks have posted some useful insights.

jefferai commented 8 years ago

@skippy thanks for providing all that info!

enkaskal commented 8 years ago

i'll mimic @skippy here WRT logs, (although 1-3 seconds for a leader election is crazy long in my experience!) but we've been running vault in production backed by an HA consul cluster since 0.4.0 and consul 0.5.x which has been solid in AWS. The only times we've had an issue is that occasionally (very rare) we will see a vault master/slave switch in which our app doesn't catch&retry (open ticket with Dev soon to be fixed :). Other than that it's been working flawlessly; with one exception...

We did have an issue, a disaster event actually, when upgrading our production environment from 0.5.x to 0.6.x because we missed the notice at https://www.consul.io/docs/upgrade-specific.html regarding the switch in raft backend storage.

Fortunately, we were able to recover from that event with minimal impact, and while we're not sure why the auto-migration was removed, it was a good lesson learned for us and therefore a win overall (Since then we've instituted a policy to always check the release notes and that page prior to any changes! :)

Moreover, again as @skippy mentioned, you should absolutely make sure your consul cluster is stable. I've seen (many times in our DevTest env) where an unstable consul cluster will cause severe issues in vault. Particularly, when performing a rolling upgrade. I highly, highly, recommend watching the raft peers.json (from all consul server nodes!) any time you roll your cluster to make sure raft is in alignment. If not, you're going to have a hard time and the only way we've found to recover is by bringing up a consul server node at the same IP to satisfy the cluster before again trying to remove/leave.

Finally, if you haven't already gone through the raft tutorial at: http://thesecretlivesofdata.com/raft/ I highly recommend it!

Hope that helps :)

P.S. backups (e.g. via consul-backinator, or whatever) are (of course) best-practice; as are testing restores!

my $0.02

skippy commented 8 years ago

@enkaskal I should have mentioned that I only saw those 1-3 second delays in vault leader election when consul was fluttering a lot (caused by high IO on multiple consul leader boxes, in our case). But for the occasional consul leader change (which now happens ~1-3 times per day) we see zero vault transitions. It may have been, in our case, that consul really fluttering caused vault to try new leaders and then take a bit of time to recover. But honestly, I didn't dive into it all that much because the significant consul fluttering (which was our fault!) was clearly the root and needed to be addressed asap. It was, and vault has been quite stable ever since.

enkaskal commented 8 years ago

@skippy thanks for the follow-up! I find it interesting that you're still seeing leader elections ~1-3x/day at present. We go days & weeks without any; although we're probably not as scaled. Do you have any insight into why so frequent? Would you be willing to share how many consul servers and clients you're running? Also, if you're using some sort of auto-scaling for them?

Thanks again, and any insight is be much appreciated :)

skippy commented 8 years ago

hey @enkaskal, definitely feel free to ping me directly if you want to chat. This response will be a bit long and rambling.

tl;dr: when we used m4.large with just consul/vault, we rarely saw fluttering at all; but we are running m3.medium as it is cheaper and we can live with consul fluttering every now and then. The occasional consul master fluter doesn't seem to bother consul cluster or vault cluster stability at all.

background

We originally had our 3 consul servers running on 3 separate AWS m4.2xlarge within docker containers. Other processes were running on these boxes (nothing web-facing, but back-end facing services like vault, elasticsearch, queue workers, etc). We realized over a few months running on production that consul and vault were our primary 'points of failure'; if anything systematic happened to that cluster of boxes, the whole site would go down. We ran into all sorts of non-related consul issues that caused consul to go down: queue workers slamming the box's memory, cpu, and network/disk IO; elasticsearch flooding the logs (which causes issues with SystemD's LoggerD as well as docker, which had (as of docker 1.9) an unbounded logging queue); Docker losing UDP connectivity if an instance is restarted, etc. There have been lots of outages for various reasons, and none of them would have fully taken down the site except that it stressed consul and that caused all sorts of problems.

improvements

Consul 0.5+ has been much more stable for us
we moved consul outside of docker; we actually use docker as the packaging mechanism, but we just copy it over and run it on the local CoreOS instance. Everything else runs in docker containers, including vault.
we run consul and vault together, but isolated on their own instances (both for stability as well as operational controls; very few folks have access to those boxes)

all 3 of those have been huge wins for us:

every new consul release is just more and more stable; it is pretty damn awesome actually
While I'm a fan of docker, we seem to be very good at finding stability issues with it (UDP packets getting dropped, logger queue being unbounded and causing memory exhaustion, etc). Docker is quickly evolving, but we are using CoreOS which uses older versions of docker, but docker 1.10 seems to be much more stable for us
This last one helps us prevent systemic outages caused by other services, such as elasticsearch going fritz, or a queue worker misbehaving.

details

We moved from using m4.2xlarge to m3.medium for the consul/vault cluster. We found that m4.large worked really well and was rock-solid, but it was also a lot of machine just sitting there. We are currently using m3.medium, and have found that it works but isn't as solid... it is low enough network performance that it switches master 1-3 times a day, and then every month or so we see a significant rise in consul master fluttering (on Monday we saw ~37 master switches over a 4 hr period, something we never say over 3 months running m4.large; it looked like ping time between instances across AZs jumped a lot and was also very variable). But, for our infrastructure, the occasional flutter is not a problem so we live with m3.medium for now.

Infrastructure

At a high level:

8-15 aws instances
50-130 services running at any given time, registered with consul
we liberally use consul health checks, so there are between 200 and 500 checks running at any given time across all instances (~ 3 per service; usually a service status check and then some cpu/memory checks, and ~10 per host; docker will restart a service if it uses too much memory, but we have consul take it out of DNS before docker kills it)
we use consul DNS for service discovery, but we only run *.service.consul checks through consul dns. DNSMasq runs on every box and sits in front of consul; all dns queries do go through DNSMasq and then it either sends queries to aws dns servers or consul. We allow consul to serve stale dns and have a ttl of 2-10 seconds based upon the type of service (and whether that service has a layer in front of it such as haproxy which can retry and route around stale dns entries)
we use a 3-server consul cluster
we run 3 unlocked vault servers at any given time
vault uses consul as its HA store and coordinator
the services use the DNS entry active.vault.service.consul, so backup vault instances aren't routing requests to the master (not sure if this is a big deal; we haven't measured the improvement of hitting active directly vs just vault.service.consul)
each service stores prod-specific configs, such as external service credentials, in vault
we only lightly use consul k/v store outside of vault
some of the services use consul for distributed semaphors (such as the queue workers so they only process one specific job at a time), but this is a pretty light usage (just a handful of locks used every hour or so)
all aws instances are evenly spread out over 3 AZs, so we have some network delay between instances
we use consul events for for things like rolling restarts, but like consul locks, these are just a few events per hour.

skippy commented 8 years ago

@enkaskal and I've been watching hashicorp/consul#1212 as I suspect this will improve the occasional fluttering behavior I'm seeing when running consul on m3.medium.

jefferai commented 8 years ago

@skippy Some comments:

Consul 0.5+ has been much more stable for us

Hopefully you're keeping fairly up-to-date as I think handling instability has been a focus of the 0.6 series. In particular:

While I'm a fan of docker, we seem to be very good at finding stability issues with it (UDP packets getting dropped, logger queue being unbounded and causing memory exhaustion, etc). Docker is quickly evolving, but we are using CoreOS which uses older versions of docker, but docker 1.10 seems to be much more stable for us

This is unfortunately quite true -- but recent Consul has a TCP-based "catch-up" mechanism and I believe the next version coming out has some features to prevent a single bad report of a node from being removed without some agreement from nodes. So that should help, and should also help with the occasional ping blip causing a leader election. I know that you said that you are running Consul on the host now, which is recommended, but otherwise as recommended in https://hub.docker.com/_/consul/ you should bind the net in the Consul container to host, which should keep Docker's UDP wonkiness from affecting things.

skippy commented 8 years ago

@jefferai as always, thanks for your comments and details!

Consul 0.5+ has been much more stable for us

Hopefully you're keeping fairly up-to-date as I think handling instability has been a focus of the 0.6 series.

oh definitely! 0.6 has been even better, and I'm very much looking to the changes coming down the pike that are in Consul-Master. We have run into hashicorp/consul#1212 quite a bit, but we now know the underlying system triggers (outlined below) and avoid them. But the next release should address our last remaining (known) stability issue with consul (and thus vault) on our ENV

It is interesting how unrelated issues can cause consul instability, esp. on smaller instances. The below chain of events won't take down a consul cluster on >= m4.large, but when running <= m3.large this chain will cause consul instability and take the whole cluster down if this happens on one consul server:

lots of short-lived utility docker containers are triggered (think short-lived cron jobs)
over time (3-4 weeks in our case), this causes the /run directory on coreOS, which is backed by tmpfs (i.e. memory) to grow. See https://github.com/coreos/bugs/issues/1424#issuecomment-233062476 and https://github.com/coreos/bugs/issues/1081
once this grows to a large enough % of total memory, it triggers kswapd0 to continuously thrash, per: https://bugzilla.kernel.org/show_bug.cgi?id=65201.
the thrashing is enough to spike CPU and IO and cause consul to flutter, but not enough for consul or AWS health checks to terminate the instance. On smaller instances, such as T2, AWS will seem to terminate the instance, but M3 doesn't quite reach that threshold. When this occurs on an M3 instance, there is no going back; it eventually will make the whole cluster unstable we think because of hashicorp/consul#1212. On M4.large or larger, the percent memory used by tmpfs is low enough relative to the total box that it doesn't trigger kswapd0 to run as easily, and if kswapd0 is triggered, there is enough spare CPU and IO capacity that it doesn't cause consul to flutter and trigger hashicorp/consul#1212. But we run 'on the edge' with M3 and then have alerts in place to avoid kswapd0 being triggered in the first place.

I should note that the underlying SystemD issue is fixed in the latest CoreOS stable branch, and since we've moved to that a few weeks ago, we have had no triggers of kswapd0 and thus hashicorp/consul#1212;

@ruchi15 I took your ticket and went a bit off on a tangent about something very specific to our ENV; does any of this help or are you still having issues?

jefferai commented 8 years ago

As this hasn't had any attention from the OP, closing for now.

divvy19 commented 6 years ago

@ruchi15 You said that in your first comment you had already enable monitoring of the vault and consul servers whenever they go down . I am curious about how did you do that ?? Using some tool or api calls . It would be great if you can help me out with this thing . Any more comments around how to setup alert monitoring system around vault through anything will be helpful. Thanks in advance.

hashicorp / vault