CLOSE_WAIT, too many open files

heaven commented 8 years ago

This code:

easy = Curl::Easy.new(site.rss_url)
easy.ssl_verify_peer = false
easy.follow_location = true
easy.max_redirects = 5
easy.connect_timeout = 15
easy.timeout = 20
easy.perform

leaks connections.

After some time I am getting errors from sidekiq: Error fetching job: Too many open files - getaddrinfo.

$ netstat -an | awk '/tcp/ {print $6}' | sort | uniq -c:

   4024 CLOSE_WAIT
    163 ESTABLISHED
     12 LISTEN
     85 TIME_WAIT

$ lsof -p 6245 | grep CLOSE_WAIT | wc -l

A part of lsof output:

ruby    6245 deploy 4085u  IPv4 445657907         0t0       TCP ...:38806->theautismeducationsite.com:http (CLOSE_WAIT)
ruby    6245 deploy 4086u  IPv4 445657898         0t0       TCP ...:42963->qm-in-f132.1e100.net:http (CLOSE_WAIT)
ruby    6245 deploy 4087u  IPv4 445657912         0t0       TCP ...:45454->qm-in-f118.1e100.net:http (CLOSE_WAIT)
ruby    6245 deploy 4088u  IPv4 445657915         0t0       TCP ...:49490->192.0.78.13:https (CLOSE_WAIT)
ruby    6245 deploy 4089u  IPv4 445661545         0t0       TCP ...:52097->qm-in-f121.1e100.net:http (CLOSE_WAIT)
ruby    6245 deploy 4090u  IPv4 445657929         0t0       TCP ...:42984->qm-in-f132.1e100.net:http (CLOSE_WAIT)
ruby    6245 deploy 4091u  IPv4 445658814         0t0       TCP ...:42973->qm-in-f132.1e100.net:http (CLOSE_WAIT)
ruby    6245 deploy 4092u  IPv4 445648740         0t0       TCP ...:60335->ps478203.dreamhost.com:http (CLOSE_WAIT)
ruby    6245 deploy 4093u  IPv4 445659562         0t0       TCP ...:42985->qm-in-f132.1e100.net:http (CLOSE_WAIT)
ruby    6245 deploy 4094u  IPv4 445659564         0t0       TCP ...:43010->qm-in-f132.1e100.net:http (CLOSE_WAIT)
ruby    6245 deploy 4095u  IPv4 445648749         0t0       TCP ...:43009->qm-in-f132.1e100.net:http (CLOSE_WAIT)

robuye commented 5 years ago

I could test it only on ubuntu 16.04 on EC2 and I see leaked FDs there. That's on 0.9.7 only, but it's probably irrelevant. Same on docker centos-ruby (2.6).

I will see if I can debug this bug further over the weekend.

taf2 commented 5 years ago

yeah, i must not be testing this correctly then... Here's a gist I created to help me run the loop and also monitor netstat... maybe i'm not filtering netstat correctly? Or maybe something about how I'm running it...

https://gist.github.com/taf2/683ca2f9cec226de44c8f992b1ca5cc2

taf2 commented 5 years ago

okay typical me! I was running 0.9.2 🤯

anyways the test script in the gist should be helpful then in fixing this.

robuye commented 5 years ago

hah that happened too often to me. I noticed calling GC.start closes all the FDs if that's of any help.

taf2 commented 5 years ago

We can use this to add it as a regression test... should figure out how to get netstat for bsd stystems to work to for osx users.

taf2 commented 5 years ago

When modifying the test script to GC.start after m.perform, the connection counts stay in the 2 - 3 range... so when GC runs connections are cleaned up. The issue here is we can either wait and let GC do the connection close or we keep the connection around because maybe the end user will want to use the connection again...

I'm thinking we might need an explicit option for keep the connection open so it can be re-used or close it right away...

robuye commented 5 years ago

Yea, I was thinking something along the lines, but I don't understand why this has changed as of 0.9.4 and I'm unsure if the connections are actually reused as they should :thinking:

robuye commented 5 years ago

Restoring these lines from e3126a fixes the problem:

  curl_multi_cleanup(rbcm->handle);
  rbcm->handle = NULL;

the FDs are going down to zero over ~ 30 seconds. This delay is interesting, it looks like a timeout is kicking in?

robuye commented 5 years ago

Bear in mind it's not a real fix, because if it stops curb from reusing connections that's pretty serious regression.

robuye commented 5 years ago

Actually looking at the description of this commit this seems to be exactly what you were looking for @taf2 - to close FDs without waiting for GC.

Calling it at the end of perform seems reasonable because at this point we stopped all the transfers.

The commit you added included more changes and perhaps this would prevent curb from reusing connections.

The test code we were using creates 100 multi handles and each has 2 easy handles so it doesn't show if the connections are reused. Also high number of FDs is expected. I will try to come up with another test to see if we reuse connections or not.

taf2 commented 5 years ago

@robuye yeah, I think the case of re-use is when doing something simple like:

res = Curl.get("https://www.google.com/")

...

res = Curl.get("https://www.google.com/") # should re-use existing connection ?

robuye commented 5 years ago

I think this would create separate multis each with own pool and they wouldn't be able to reuse connections. At least that's my understanding of libcurl.

I was thinking we'd need multiple requests in a single multi and to the same host (ideally HTTPS so we pay massive overhead here). And those should be slow to respond so the pool fills up and reusing begins.

robuye commented 5 years ago

I think I have figured it out and I also have some food for thoughts.

So libcurl will try to hold a connection open for as long as possible and it will reuse cached connections if possible too. See this thread from libcurl mailing list, it goes into details about this behaviour and Daniel explains it much better than I could.

In case of Ruby we will keep the FDs open until multi handle is GCed. GC calls curl_multi_cleanup(rbcm->handle); and then libcurl closes all FDs. Curb hooks it into multi finalizer and this ties FDs lifetime to GC runs.

Note that once curl_multi_cleanup is called the multi handle cannot be reused. Hooking it into the finalizer saves curb a lot of trouble because it will always work regardless of how it's used, it just doesn't behave well in long running, rarely GCed processes.

The code I used to test this behaviour is significantly longer so I will just link a gist here. It demonstrates how connections are reused and shows interesting results so give it a shot.

I have enabled multi.max_connects = 1 there and it causes the multi handle to keep 1 connection in cache at most, therefore this alone can mitigate leaked FDs. It can't be 0 so at least 1 FD per multi will still leak (until GC kicks in). This may or may not work for you @pts-owentran & @morhekil, depends how you use multi handle (note that curb always uses multi handle, even when you use easy interface and there is no way around this).

I pointed curb to my local server to minimize any network latency and get better insight into connections. Here's the script output:

System FDs: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
These FDs will be ignored.
Additional FDs open before start (0): []
Starting multi
Transfer 1 completed. Open FDs (2): [12, 13]
FD 12; socket: 192.168.16.2:35860 => 192.168.16.3:80
FD 13; socket: 192.168.16.2:35862 => 192.168.16.3:80
Transfer 2 completed. Open FDs (2): [12, 13]
FD 12; socket: 192.168.16.2:35860 => 192.168.16.3:80
FD 13; socket: 192.168.16.2:35862 => 192.168.16.3:80
Transfer 3 completed. Open FDs (2): [12, 13]
FD 12; socket: 192.168.16.2:35860 => 192.168.16.3:80
FD 13; socket: 192.168.16.2:35862 => 192.168.16.3:80
Transfer 4 completed. Open FDs (1): [13]
FD 13; socket: 192.168.16.2:35862 => 192.168.16.3:80
Transfer 5 completed. Open FDs (1): [13]
FD 13; socket: 192.168.16.2:35862 => 192.168.16.3:80
All transfers completed
Open FDs after multi completed (1): [13]
FD 13; socket: 192.168.16.2:35862 => 192.168.16.3:80
Sleeping 10 seconds...
..........
Open FDs after sleep (1): [13]
FD 13; socket: 192.168.16.2:35862 => 192.168.16.3:80

Add 5 new transfers

FDs open before start (1): [13]
FD 13; socket: 192.168.16.2:35862 => 192.168.16.3:80
Starting multi
Transfer 1 completed. Open FDs (1): [12]
FD 12; socket: 192.168.16.2:35870 => 192.168.16.3:80
Transfer 2 completed. Open FDs (1): [12]
FD 12; socket: 192.168.16.2:35870 => 192.168.16.3:80
Transfer 3 completed. Open FDs (1): [12]
FD 12; socket: 192.168.16.2:35870 => 192.168.16.3:80
Transfer 4 completed. Open FDs (1): [12]
FD 12; socket: 192.168.16.2:35870 => 192.168.16.3:80
Transfer 5 completed. Open FDs (1): [12]
FD 12; socket: 192.168.16.2:35870 => 192.168.16.3:80
All transfers completed
Open FDs after multi completed (1): [12]
FD 12; socket: 192.168.16.2:35870 => 192.168.16.3:80

And here's what happened according to server:

{ "@timestamp":"2019-01-27T14:37:22+00:00","connection":"524 (192.168.16.2:35854)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.001","status":"200" }
{ "@timestamp":"2019-01-27T14:37:22+00:00","connection":"525 (192.168.16.2:35856)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.001","status":"200" }
{ "@timestamp":"2019-01-27T14:37:22+00:00","connection":"526 (192.168.16.2:35858)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.001","status":"200" }
{ "@timestamp":"2019-01-27T14:37:22+00:00","connection":"527 (192.168.16.2:35860)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.000","status":"200" }
{ "@timestamp":"2019-01-27T14:37:22+00:00","connection":"528 (192.168.16.2:35862)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.000","status":"200" }
{ "@timestamp":"2019-01-27T14:37:33+00:00","connection":"528 (192.168.16.2:35862)","connection_requests_count":"2","request":"GET /sleep/1 HTTP/1.1","request_time":"1.001","status":"200" }
{ "@timestamp":"2019-01-27T14:37:33+00:00","connection":"529 (192.168.16.2:35864)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.001","status":"200" }
{ "@timestamp":"2019-01-27T14:37:33+00:00","connection":"530 (192.168.16.2:35866)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.001","status":"200" }
{ "@timestamp":"2019-01-27T14:37:33+00:00","connection":"531 (192.168.16.2:35868)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.000","status":"200" }
{ "@timestamp":"2019-01-27T14:37:33+00:00","connection":"532 (192.168.16.2:35870)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.000","status":"200" }

All 10 requests made by curb, but only 1 reused a connection. When I removed multi.max_connects = 1 multi would keep initial 5 FDs open and it would reuse it for the second batch of easy handles:

{ "@timestamp":"2019-01-27T14:46:12+00:00","connection":"538 (192.168.16.2:35914)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.001","status":"200" }
{ "@timestamp":"2019-01-27T14:46:12+00:00","connection":"539 (192.168.16.2:35916)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.000","status":"200" }
{ "@timestamp":"2019-01-27T14:46:12+00:00","connection":"540 (192.168.16.2:35918)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.000","status":"200" }
{ "@timestamp":"2019-01-27T14:46:12+00:00","connection":"541 (192.168.16.2:35920)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.000","status":"200" }
{ "@timestamp":"2019-01-27T14:46:12+00:00","connection":"542 (192.168.16.2:35922)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.000","status":"200" }
{ "@timestamp":"2019-01-27T14:46:23+00:00","connection":"538 (192.168.16.2:35914)","connection_requests_count":"2","request":"GET /sleep/1 HTTP/1.1","request_time":"1.002","status":"200" }
{ "@timestamp":"2019-01-27T14:46:23+00:00","connection":"539 (192.168.16.2:35916)","connection_requests_count":"2","request":"GET /sleep/1 HTTP/1.1","request_time":"1.002","status":"200" }
{ "@timestamp":"2019-01-27T14:46:23+00:00","connection":"540 (192.168.16.2:35918)","connection_requests_count":"2","request":"GET /sleep/1 HTTP/1.1","request_time":"1.002","status":"200" }
{ "@timestamp":"2019-01-27T14:46:23+00:00","connection":"541 (192.168.16.2:35920)","connection_requests_count":"2","request":"GET /sleep/1 HTTP/1.1","request_time":"1.002","status":"200" }
{ "@timestamp":"2019-01-27T14:46:23+00:00","connection":"542 (192.168.16.2:35922)","connection_requests_count":"2","request":"GET /sleep/1 HTTP/1.1","request_time":"1.002","status":"200" }

This is somewhat irrelevant to the FDs leak we discuss here, but I wanted to make sure we're not introducing another regression.

Regarding the FDs leak I think the best fix is to expose curl_multi_cleanup to Ruby-land and allow user to call it at any time to release FDs. Something along the lines of:

URL = 'http://google.com'
multi = Curl::Multi.new
multi.add(Curl::Easy.new(URL))
multi.perform
# connections & FDs remain open and will be reused by next handle
multi.add(Curl::Easy.new(URL))
multi.perform
multi.close
# all connections & FDs are closed
multi.add(Curl::Easy.new(URL))
# => Curl::Err::MultiBadHandle: CURLError: Invalid multi handle
multi.perform
# => Curl::Err::MultiBadHandle: CURLError: Invalid multi handle

Each multi manages it's own pool of connections and it won't reuse FDs from another multi so leaving sockets hanging until GC runs feels quite wasteful. We could provide an option to autoclose a handle when perform returns too (maybe Curl::Multi.autoclose = true), it could be convenient to set it once (ie in Rails initializer) and never worry about it again.

I think it should be fairly simple and safe fix, so let me know if it's feasible.

taf2 commented 5 years ago

@robuye great analysis, I like the idea of a Curl::Multi.autoclose = true and exposing multi.close. Maybe something like the following: https://github.com/taf2/curb/tree/autoclose_close

I agree we need some regression tests for connection keep-alive and connection leaking.

robuye commented 5 years ago

Thanks! Your implementation is better than what I came up with. Cool idea with re-initializing @requests.

I ran my gist on your branch and it seems to be working exactly as expected. Nice work :+1:

pts-owentran commented 5 years ago

Great work, Rob and Todd. Thank you both so much for investing the time to hunt this down. Happy to test whenever the fix is in. I was looking at the behavior from a year back, but still odd I couldn't make it reproduce the issue at will. It must have been some GC timing issue and the way my app uses the Parallel gem to create these connections in threads.

taf2 commented 5 years ago

@pts-owentran thanks, we've pushed 0.9.8 to rubygems. If you have chance to run this let me know, I think with the updates to include a close method and autoclose we're actually close to a 1.x release 😄

taf2 / curb

CLOSE_WAIT, too many open files #277