Closed brockoffdev closed 8 years ago
Hi @brockoffdev,
We've seen this in the past with very large numbers of leases all expiring at once trying to open huge numbers of connectiosn to Consul, but we actually implemented a connection pool in the Consul physical backend to prevent that. So offhand I don't have a good idea.
Raising the ulimit for the user running Vault may help, but first it would be nice to understand what/where the open files are being used. Various combinations of lsof
flags may help a lot; if your Vault user is vault
, try:
lsof -a -i -uvault
lsof -a -U -uvault
lsof -uvault
to see open network connections, open Unix sockets, and all open files by that user, respectively. Hopefully that will find something immediately apparent; even if it's Vault itself, it would be helpful to figure out where Vault is receiving connections from (which aren't being dropped/closed) or making connections to (potentially too many at a time).
Let me know what you find out!
Ah, so interesting, it does appear that Vault is establishing a pretty large amount of tcp connections:
vault 17229 vault 136u IPv6 2582322 0t0 TCP ip-10-10-3-86.ec2.internal:8200->ip-10-10-0-171.ec2.internal:3746 (ESTABLISHED)
vault 17229 vault 137u IPv6 2582324 0t0 TCP ip-10-10-3-86.ec2.internal:8200->ip-10-10-0-171.ec2.internal:3747 (ESTABLISHED)
vault 17229 vault 138u IPv6 2582326 0t0 TCP ip-10-10-3-86.ec2.internal:8200->ip-10-10-0-171.ec2.internal:3748 (ESTABLISHED)
vault 17229 vault 139u IPv6 2582328 0t0 TCP ip-10-10-3-86.ec2.internal:8200->ip-10-10-0-171.ec2.internal:3749 (ESTABLISHED)
vault 17229 vault 140u IPv6 2582330 0t0 TCP ip-10-10-3-86.ec2.internal:8200->ip-10-10-2-188.ec2.internal:64015 (ESTABLISHED)
vault 17229 vault 141u IPv6 2582331 0t0 TCP ip-10-10-3-86.ec2.internal:8200->ip-10-10-2-188.ec2.internal:64016 (ESTABLISHED)
vault 17229 vault 142u IPv6 2582388 0t0 TCP ip-10-10-3-86.ec2.internal:8200->ip-10-10-0-171.ec2.internal:3761 (ESTABLISHED)
vault 17229 vault 143u IPv4 2582419 0t0 TCP ip-10-10-3-86.ec2.internal:50248->205.251.242.222:https (ESTABLISHED)
vault 17229 vault 144u IPv4 2582422 0t0 TCP ip-10-10-3-86.ec2.internal:50249->205.251.242.200:https (ESTABLISHED)
vault 17229 vault 145u IPv4 2582425 0t0 TCP ip-10-10-3-86.ec2.internal:50250->205.251.242.200:https (ESTABLISHED)
vault 17229 vault 146u IPv4 2582428 0t0 TCP ip-10-10-3-86.ec2.internal:50251->205.251.242.200:https (ESTABLISHED)
vault 17229 vault 147u IPv4 2582431 0t0 TCP ip-10-10-3-86.ec2.internal:50252->205.251.242.200:https (ESTABLISHED)
vault 17229 vault 148u IPv4 2582434 0t0 TCP ip-10-10-3-86.ec2.internal:50253->205.251.242.200:https (ESTABLISHED)
vault 17229 vault 149u IPv4 2582437 0t0 TCP ip-10-10-3-86.ec2.internal:50254->205.251.242.200:https (ESTABLISHED)
vault 17229 vault 150u IPv4 2582440 0t0 TCP ip-10-10-3-86.ec2.internal:50255->205.251.242.200:https (ESTABLISHED)
vault 17229 vault 151u IPv6 2582544 0t0 TCP ip-10-10-3-86.ec2.internal:8200->ip-10-10-0-171.ec2.internal:3781 (ESTABLISHED)
vault 17229 vault 152u IPv6 2582588 0t0 TCP ip-10-10-3-86.ec2.internal:8200->ip-10-10-2-188.ec2.internal:64054 (ESTABLISHED)
vault 17229 vault 153u IPv4 2582633 0t0 TCP ip-10-10-3-86.ec2.internal:41462->72.21.214.99:https (ESTABLISHED)
vault 17229 vault 154u IPv4 2582636 0t0 TCP ip-10-10-3-86.ec2.internal:41463->72.21.214.99:https (ESTABLISHED)
vault 17229 vault 155u IPv4 2582639 0t0 TCP ip-10-10-3-86.ec2.internal:41464->72.21.214.99:https (ESTABLISHED)
vault 17229 vault 156u IPv4 2582642 0t0 TCP ip-10-10-3-86.ec2.internal:41465->72.21.214.99:https (ESTABLISHED)
vault 17229 vault 157u IPv4 2582645 0t0 TCP ip-10-10-3-86.ec2.internal:41466->72.21.214.99:https (ESTABLISHED)
vault 17229 vault 158u IPv4 2582648 0t0 TCP ip-10-10-3-86.ec2.internal:41467->72.21.214.99:https (ESTABLISHED)
vault 17229 vault 159u IPv4 2582651 0t0 TCP ip-10-10-3-86.ec2.internal:41468->72.21.214.99:https (ESTABLISHED)
vault 17229 vault 160u IPv4 2582654 0t0 TCP ip-10-10-3-86.ec2.internal:41469->72.21.214.99:https (ESTABLISHED)
vault 17229 vault 161u IPv4 2582661 0t0 TCP ip-10-10-3-86.ec2.internal:46620->iam.amazonaws.com:https (ESTABLISHED)
vault 17229 vault 162u IPv4 2582664 0t0 TCP ip-10-10-3-86.ec2.internal:46621->iam.amazonaws.com:https (ESTABLISHED)
vault 17229 vault 163u IPv4 2582667 0t0 TCP ip-10-10-3-86.ec2.internal:46622->iam.amazonaws.com:https (ESTABLISHED)
vault 17229 vault 164u IPv4 2582670 0t0 TCP ip-10-10-3-86.ec2.internal:46623->iam.amazonaws.com:https (ESTABLISHED)
vault 17229 vault 165u IPv4 2582673 0t0 TCP ip-10-10-3-86.ec2.internal:46624->iam.amazonaws.com:https (ESTABLISHED)
vault 17229 vault 166u IPv4 2582676 0t0 TCP ip-10-10-3-86.ec2.internal:46625->iam.amazonaws.com:https (ESTABLISHED)
vault 17229 vault 167u IPv4 2582679 0t0 TCP ip-10-10-3-86.ec2.internal:46626->iam.amazonaws.com:https (ESTABLISHED)
vault 17229 vault 168u IPv4 2582682 0t0 TCP ip-10-10-3-86.ec2.internal:46627->iam.amazonaws.com:https (ESTABLISHED)
Should be noted that ip-10-10-3-86.ec2.internal references both the vault master, and the consul-agent running on the vault box; the vault user has an unlimited ulimit
. But 168 connections opened...
My main takeaway was "unlimited ulimit
and they're still seeing this...?" :-)
168 seems a lot-ish, but ulimit for a normal user is usually (these days) 2048 or 4096...so there's certainly no slam-dunk here.
What about non-Vault processes on the box? If the user has no limit, it's possible that the box itself has run out of connections.
Thanks @jefferai appreciate the help, btw!
Nope vault user is by far using the most connections. Nothing else is even coming close. I'll keep monitoring for the time being. :)
OK. Try using whatever tools you can think of to find a lot of open files on the system, or in open unix sockets, or IP sockets. I'm fine assuming it's Vault until proven otherwise, but I can't think of a change in 0.4 offhand that should cause any such problem, and we in fact explicitly put in changes in 0.4 to prevent this kind of problem (for consul and etcd backends). So I'm currently mystified.
Yeah, definitely a curious bug. I wish I could go back and see if 0.3.1 had any of these same issues, but I know this has not come up prior to the 0.4.0 upgrade.
@jefferai so very curious thing...I noticed that most of the open connections were being maintained by the AWS backed on vault (see: iam.amazonaws.com, 205.251.242.200). Just as a test (since our aws mount isn't critical), I decided to unmount our aws mount point, restart vault, and remount the aws mount point.
Sure enough, the issue seems to be rectified! I'm going to keep an eye on the issue moving forward just to be sure, but it would seem that something with the aws backend caused the connection pool to become overloaded. Thoughts?
There are only two things that changed specific to the AWS backend during the 0.4 release cycle:
I would bet a decent sum that (2) is the cause of your problem. I can tell you that since 0.4 released they've already gone from 1.0.2 to 1.0.5 so they're updating it at a good clip. Unfortunately the release notes only list feature improvements, not bug fixes.
If you have a Go dev environment, I'd be interested in seeing how things behave with 1.0.5 -- I could bump the dep in our tree if you like. Or, I could make you a binary with a newer version of their SDK.
@jefferai thanks much for that. Let's keep that idea on hold for now. I've been doing some testing against the newly created AWS backend and all seems to be fine. The connection pool has reached at most 45 connections, very much down from the 165 I saw earlier...and even 245 at its peak. It seems like the aws-sdk-go client may be a good place to investigate...
OK. I'll keep this issue open; if you think it's resolved feel free to close it (or let me know and I'll close it).
@brockoffdev While it could be a combination of multiple factors, it's actually possible it's related to (1) :-(
See https://github.com/fsouza/go-dockerclient/pull/434 -- it seems that Golang doesn't properly close idle connections when a client goes out of scope/is GCd. I'm trying to get more info from the reporter.
@brockoffdev If that is the case it would seem pure luck that you didn't hit it with the 0.3.1 build.
I'm getting this issue also.
Hi All-
Strange issue I recently ran into with Vault (v0.4.0) and Consul (v0.5.2). System had been up and running for a solid month (save for the downtime to upgrade to 0.4.0), until today, with no problems. This morning, we noted an inability to reach the Vault servers, and upon checking the logs, we saw a large amount of errors of type "too many open files":
16:48:52 http: Accept error: accept tcp [::]:8200: accept4: too many open files; retrying in 1s
We were able to rectify the issue simply by restarting vault (which in itself transitioned the master unit to our backup vault master machine, so that failover worked very well), but are working to find a solution to this problem. We noted that this appears to be a Consul connections limit issue (see: https://github.com/hashicorp/consul/issues/688), but were hoping some on here may have run into this issue. Any thoughts?