hashicorp / vault

A tool for secrets management, encryption as a service, and privileged access management
https://www.vaultproject.io/
Other
30.94k stars 4.18k forks source link

"Too many open files" on Vault #867

Closed brockoffdev closed 8 years ago

brockoffdev commented 8 years ago

Hi All-

Strange issue I recently ran into with Vault (v0.4.0) and Consul (v0.5.2). System had been up and running for a solid month (save for the downtime to upgrade to 0.4.0), until today, with no problems. This morning, we noted an inability to reach the Vault servers, and upon checking the logs, we saw a large amount of errors of type "too many open files":

16:48:52 http: Accept error: accept tcp [::]:8200: accept4: too many open files; retrying in 1s

We were able to rectify the issue simply by restarting vault (which in itself transitioned the master unit to our backup vault master machine, so that failover worked very well), but are working to find a solution to this problem. We noted that this appears to be a Consul connections limit issue (see: https://github.com/hashicorp/consul/issues/688), but were hoping some on here may have run into this issue. Any thoughts?

jefferai commented 8 years ago

Hi @brockoffdev,

We've seen this in the past with very large numbers of leases all expiring at once trying to open huge numbers of connectiosn to Consul, but we actually implemented a connection pool in the Consul physical backend to prevent that. So offhand I don't have a good idea.

Raising the ulimit for the user running Vault may help, but first it would be nice to understand what/where the open files are being used. Various combinations of lsof flags may help a lot; if your Vault user is vault, try:

to see open network connections, open Unix sockets, and all open files by that user, respectively. Hopefully that will find something immediately apparent; even if it's Vault itself, it would be helpful to figure out where Vault is receiving connections from (which aren't being dropped/closed) or making connections to (potentially too many at a time).

Let me know what you find out!

brockoffdev commented 8 years ago

Ah, so interesting, it does appear that Vault is establishing a pretty large amount of tcp connections:

vault     17229 vault  136u  IPv6 2582322      0t0  TCP ip-10-10-3-86.ec2.internal:8200->ip-10-10-0-171.ec2.internal:3746 (ESTABLISHED)

vault     17229 vault  137u  IPv6 2582324      0t0  TCP ip-10-10-3-86.ec2.internal:8200->ip-10-10-0-171.ec2.internal:3747 (ESTABLISHED)

vault     17229 vault  138u  IPv6 2582326      0t0  TCP ip-10-10-3-86.ec2.internal:8200->ip-10-10-0-171.ec2.internal:3748 (ESTABLISHED)

vault     17229 vault  139u  IPv6 2582328      0t0  TCP ip-10-10-3-86.ec2.internal:8200->ip-10-10-0-171.ec2.internal:3749 (ESTABLISHED)

vault     17229 vault  140u  IPv6 2582330      0t0  TCP ip-10-10-3-86.ec2.internal:8200->ip-10-10-2-188.ec2.internal:64015 (ESTABLISHED)

vault     17229 vault  141u  IPv6 2582331      0t0  TCP ip-10-10-3-86.ec2.internal:8200->ip-10-10-2-188.ec2.internal:64016 (ESTABLISHED)

vault     17229 vault  142u  IPv6 2582388      0t0  TCP ip-10-10-3-86.ec2.internal:8200->ip-10-10-0-171.ec2.internal:3761 (ESTABLISHED)

vault     17229 vault  143u  IPv4 2582419      0t0  TCP ip-10-10-3-86.ec2.internal:50248->205.251.242.222:https (ESTABLISHED)

vault     17229 vault  144u  IPv4 2582422      0t0  TCP ip-10-10-3-86.ec2.internal:50249->205.251.242.200:https (ESTABLISHED)

vault     17229 vault  145u  IPv4 2582425      0t0  TCP ip-10-10-3-86.ec2.internal:50250->205.251.242.200:https (ESTABLISHED)

vault     17229 vault  146u  IPv4 2582428      0t0  TCP ip-10-10-3-86.ec2.internal:50251->205.251.242.200:https (ESTABLISHED)

vault     17229 vault  147u  IPv4 2582431      0t0  TCP ip-10-10-3-86.ec2.internal:50252->205.251.242.200:https (ESTABLISHED)

vault     17229 vault  148u  IPv4 2582434      0t0  TCP ip-10-10-3-86.ec2.internal:50253->205.251.242.200:https (ESTABLISHED)

vault     17229 vault  149u  IPv4 2582437      0t0  TCP ip-10-10-3-86.ec2.internal:50254->205.251.242.200:https (ESTABLISHED)

vault     17229 vault  150u  IPv4 2582440      0t0  TCP ip-10-10-3-86.ec2.internal:50255->205.251.242.200:https (ESTABLISHED)

vault     17229 vault  151u  IPv6 2582544      0t0  TCP ip-10-10-3-86.ec2.internal:8200->ip-10-10-0-171.ec2.internal:3781 (ESTABLISHED)

vault     17229 vault  152u  IPv6 2582588      0t0  TCP ip-10-10-3-86.ec2.internal:8200->ip-10-10-2-188.ec2.internal:64054 (ESTABLISHED)

vault     17229 vault  153u  IPv4 2582633      0t0  TCP ip-10-10-3-86.ec2.internal:41462->72.21.214.99:https (ESTABLISHED)

vault     17229 vault  154u  IPv4 2582636      0t0  TCP ip-10-10-3-86.ec2.internal:41463->72.21.214.99:https (ESTABLISHED)

vault     17229 vault  155u  IPv4 2582639      0t0  TCP ip-10-10-3-86.ec2.internal:41464->72.21.214.99:https (ESTABLISHED)

vault     17229 vault  156u  IPv4 2582642      0t0  TCP ip-10-10-3-86.ec2.internal:41465->72.21.214.99:https (ESTABLISHED)

vault     17229 vault  157u  IPv4 2582645      0t0  TCP ip-10-10-3-86.ec2.internal:41466->72.21.214.99:https (ESTABLISHED)

vault     17229 vault  158u  IPv4 2582648      0t0  TCP ip-10-10-3-86.ec2.internal:41467->72.21.214.99:https (ESTABLISHED)

vault     17229 vault  159u  IPv4 2582651      0t0  TCP ip-10-10-3-86.ec2.internal:41468->72.21.214.99:https (ESTABLISHED)

vault     17229 vault  160u  IPv4 2582654      0t0  TCP ip-10-10-3-86.ec2.internal:41469->72.21.214.99:https (ESTABLISHED)

vault     17229 vault  161u  IPv4 2582661      0t0  TCP ip-10-10-3-86.ec2.internal:46620->iam.amazonaws.com:https (ESTABLISHED)

vault     17229 vault  162u  IPv4 2582664      0t0  TCP ip-10-10-3-86.ec2.internal:46621->iam.amazonaws.com:https (ESTABLISHED)

vault     17229 vault  163u  IPv4 2582667      0t0  TCP ip-10-10-3-86.ec2.internal:46622->iam.amazonaws.com:https (ESTABLISHED)

vault     17229 vault  164u  IPv4 2582670      0t0  TCP ip-10-10-3-86.ec2.internal:46623->iam.amazonaws.com:https (ESTABLISHED)

vault     17229 vault  165u  IPv4 2582673      0t0  TCP ip-10-10-3-86.ec2.internal:46624->iam.amazonaws.com:https (ESTABLISHED)

vault     17229 vault  166u  IPv4 2582676      0t0  TCP ip-10-10-3-86.ec2.internal:46625->iam.amazonaws.com:https (ESTABLISHED)

vault     17229 vault  167u  IPv4 2582679      0t0  TCP ip-10-10-3-86.ec2.internal:46626->iam.amazonaws.com:https (ESTABLISHED)

vault     17229 vault  168u  IPv4 2582682      0t0  TCP ip-10-10-3-86.ec2.internal:46627->iam.amazonaws.com:https (ESTABLISHED)

Should be noted that ip-10-10-3-86.ec2.internal references both the vault master, and the consul-agent running on the vault box; the vault user has an unlimited ulimit. But 168 connections opened...

jefferai commented 8 years ago

My main takeaway was "unlimited ulimit and they're still seeing this...?" :-)

168 seems a lot-ish, but ulimit for a normal user is usually (these days) 2048 or 4096...so there's certainly no slam-dunk here.

What about non-Vault processes on the box? If the user has no limit, it's possible that the box itself has run out of connections.

brockoffdev commented 8 years ago

Thanks @jefferai appreciate the help, btw!

Nope vault user is by far using the most connections. Nothing else is even coming close. I'll keep monitoring for the time being. :)

jefferai commented 8 years ago

OK. Try using whatever tools you can think of to find a lot of open files on the system, or in open unix sockets, or IP sockets. I'm fine assuming it's Vault until proven otherwise, but I can't think of a change in 0.4 offhand that should cause any such problem, and we in fact explicitly put in changes in 0.4 to prevent this kind of problem (for consul and etcd backends). So I'm currently mystified.

brockoffdev commented 8 years ago

Yeah, definitely a curious bug. I wish I could go back and see if 0.3.1 had any of these same issues, but I know this has not come up prior to the 0.4.0 upgrade.

brockoffdev commented 8 years ago

@jefferai so very curious thing...I noticed that most of the open connections were being maintained by the AWS backed on vault (see: iam.amazonaws.com, 205.251.242.200). Just as a test (since our aws mount isn't critical), I decided to unmount our aws mount point, restart vault, and remount the aws mount point.

Sure enough, the issue seems to be rectified! I'm going to keep an eye on the issue moving forward just to be sure, but it would seem that something with the aws backend caused the connection pool to become overloaded. Thoughts?

jefferai commented 8 years ago

There are only two things that changed specific to the AWS backend during the 0.4 release cycle:

  1. We changed how the http client was being created to fix #700 -- a problem you're very aware of. I believe you have been using the custom binary I created which already had these changes in them, and if so, that likely eliminates this as the problem.
  2. We updated to Amazon's new official aws-sdk-go, which hit 1.0 and had a very many internal changes.

I would bet a decent sum that (2) is the cause of your problem. I can tell you that since 0.4 released they've already gone from 1.0.2 to 1.0.5 so they're updating it at a good clip. Unfortunately the release notes only list feature improvements, not bug fixes.

If you have a Go dev environment, I'd be interested in seeing how things behave with 1.0.5 -- I could bump the dep in our tree if you like. Or, I could make you a binary with a newer version of their SDK.

brockoffdev commented 8 years ago

@jefferai thanks much for that. Let's keep that idea on hold for now. I've been doing some testing against the newly created AWS backend and all seems to be fine. The connection pool has reached at most 45 connections, very much down from the 165 I saw earlier...and even 245 at its peak. It seems like the aws-sdk-go client may be a good place to investigate...

jefferai commented 8 years ago

OK. I'll keep this issue open; if you think it's resolved feel free to close it (or let me know and I'll close it).

jefferai commented 8 years ago

@brockoffdev While it could be a combination of multiple factors, it's actually possible it's related to (1) :-(

See https://github.com/fsouza/go-dockerclient/pull/434 -- it seems that Golang doesn't properly close idle connections when a client goes out of scope/is GCd. I'm trying to get more info from the reporter.

jefferai commented 8 years ago

@brockoffdev If that is the case it would seem pure luck that you didn't hit it with the 0.3.1 build.

alexw23 commented 8 years ago

I'm getting this issue also.