Closed heaven closed 5 years ago
I could test it only on ubuntu 16.04 on EC2 and I see leaked FDs there. That's on 0.9.7
only, but it's probably irrelevant. Same on docker centos-ruby (2.6).
I will see if I can debug this bug further over the weekend.
yeah, i must not be testing this correctly then... Here's a gist I created to help me run the loop and also monitor netstat... maybe i'm not filtering netstat correctly? Or maybe something about how I'm running it...
https://gist.github.com/taf2/683ca2f9cec226de44c8f992b1ca5cc2
okay typical me! I was running 0.9.2 🤯
anyways the test script in the gist should be helpful then in fixing this.
hah that happened too often to me. I noticed calling GC.start
closes all the FDs if that's of any help.
We can use this to add it as a regression test... should figure out how to get netstat for bsd stystems to work to for osx users.
When modifying the test script to GC.start after m.perform, the connection counts stay in the 2 - 3 range... so when GC runs connections are cleaned up. The issue here is we can either wait and let GC do the connection close or we keep the connection around because maybe the end user will want to use the connection again...
I'm thinking we might need an explicit option for keep the connection open so it can be re-used or close it right away...
Yea, I was thinking something along the lines, but I don't understand why this has changed as of 0.9.4
and I'm unsure if the connections are actually reused as they should :thinking:
Restoring these lines from e3126a fixes the problem:
curl_multi_cleanup(rbcm->handle);
rbcm->handle = NULL;
the FDs are going down to zero over ~ 30 seconds. This delay is interesting, it looks like a timeout is kicking in?
Bear in mind it's not a real fix, because if it stops curb from reusing connections that's pretty serious regression.
Actually looking at the description of this commit this seems to be exactly what you were looking for @taf2 - to close FDs without waiting for GC.
Calling it at the end of perform
seems reasonable because at this point we stopped all the transfers.
The commit you added included more changes and perhaps this would prevent curb from reusing connections.
The test code we were using creates 100 multi handles and each has 2 easy handles so it doesn't show if the connections are reused. Also high number of FDs is expected. I will try to come up with another test to see if we reuse connections or not.
@robuye yeah, I think the case of re-use is when doing something simple like:
res = Curl.get("https://www.google.com/")
...
res = Curl.get("https://www.google.com/") # should re-use existing connection ?
I think this would create separate multis each with own pool and they wouldn't be able to reuse connections. At least that's my understanding of libcurl.
I was thinking we'd need multiple requests in a single multi and to the same host (ideally HTTPS so we pay massive overhead here). And those should be slow to respond so the pool fills up and reusing begins.
I think I have figured it out and I also have some food for thoughts.
So libcurl will try to hold a connection open for as long as possible and it will reuse cached connections if possible too. See this thread from libcurl mailing list, it goes into details about this behaviour and Daniel explains it much better than I could.
In case of Ruby we will keep the FDs open until multi handle is GCed. GC calls curl_multi_cleanup(rbcm->handle);
and then libcurl closes all FDs. Curb hooks it into multi finalizer and this ties FDs lifetime to GC runs.
Note that once curl_multi_cleanup
is called the multi handle cannot be reused. Hooking it into the finalizer saves curb a lot of trouble because it will always work regardless of how it's used, it just doesn't behave well in long running, rarely GCed processes.
The code I used to test this behaviour is significantly longer so I will just link a gist here. It demonstrates how connections are reused and shows interesting results so give it a shot.
I have enabled multi.max_connects = 1
there and it causes the multi handle to keep 1 connection in cache at most, therefore this alone can mitigate leaked FDs. It can't be 0 so at least 1 FD per multi will still leak (until GC kicks in). This may or may not work for you @pts-owentran & @morhekil, depends how you use multi handle (note that curb
always uses multi handle, even when you use easy interface and there is no way around this).
I pointed curb to my local server to minimize any network latency and get better insight into connections. Here's the script output:
System FDs: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
These FDs will be ignored.
Additional FDs open before start (0): []
Starting multi
Transfer 1 completed. Open FDs (2): [12, 13]
FD 12; socket: 192.168.16.2:35860 => 192.168.16.3:80
FD 13; socket: 192.168.16.2:35862 => 192.168.16.3:80
Transfer 2 completed. Open FDs (2): [12, 13]
FD 12; socket: 192.168.16.2:35860 => 192.168.16.3:80
FD 13; socket: 192.168.16.2:35862 => 192.168.16.3:80
Transfer 3 completed. Open FDs (2): [12, 13]
FD 12; socket: 192.168.16.2:35860 => 192.168.16.3:80
FD 13; socket: 192.168.16.2:35862 => 192.168.16.3:80
Transfer 4 completed. Open FDs (1): [13]
FD 13; socket: 192.168.16.2:35862 => 192.168.16.3:80
Transfer 5 completed. Open FDs (1): [13]
FD 13; socket: 192.168.16.2:35862 => 192.168.16.3:80
All transfers completed
Open FDs after multi completed (1): [13]
FD 13; socket: 192.168.16.2:35862 => 192.168.16.3:80
Sleeping 10 seconds...
..........
Open FDs after sleep (1): [13]
FD 13; socket: 192.168.16.2:35862 => 192.168.16.3:80
Add 5 new transfers
FDs open before start (1): [13]
FD 13; socket: 192.168.16.2:35862 => 192.168.16.3:80
Starting multi
Transfer 1 completed. Open FDs (1): [12]
FD 12; socket: 192.168.16.2:35870 => 192.168.16.3:80
Transfer 2 completed. Open FDs (1): [12]
FD 12; socket: 192.168.16.2:35870 => 192.168.16.3:80
Transfer 3 completed. Open FDs (1): [12]
FD 12; socket: 192.168.16.2:35870 => 192.168.16.3:80
Transfer 4 completed. Open FDs (1): [12]
FD 12; socket: 192.168.16.2:35870 => 192.168.16.3:80
Transfer 5 completed. Open FDs (1): [12]
FD 12; socket: 192.168.16.2:35870 => 192.168.16.3:80
All transfers completed
Open FDs after multi completed (1): [12]
FD 12; socket: 192.168.16.2:35870 => 192.168.16.3:80
And here's what happened according to server:
{ "@timestamp":"2019-01-27T14:37:22+00:00","connection":"524 (192.168.16.2:35854)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.001","status":"200" }
{ "@timestamp":"2019-01-27T14:37:22+00:00","connection":"525 (192.168.16.2:35856)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.001","status":"200" }
{ "@timestamp":"2019-01-27T14:37:22+00:00","connection":"526 (192.168.16.2:35858)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.001","status":"200" }
{ "@timestamp":"2019-01-27T14:37:22+00:00","connection":"527 (192.168.16.2:35860)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.000","status":"200" }
{ "@timestamp":"2019-01-27T14:37:22+00:00","connection":"528 (192.168.16.2:35862)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.000","status":"200" }
{ "@timestamp":"2019-01-27T14:37:33+00:00","connection":"528 (192.168.16.2:35862)","connection_requests_count":"2","request":"GET /sleep/1 HTTP/1.1","request_time":"1.001","status":"200" }
{ "@timestamp":"2019-01-27T14:37:33+00:00","connection":"529 (192.168.16.2:35864)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.001","status":"200" }
{ "@timestamp":"2019-01-27T14:37:33+00:00","connection":"530 (192.168.16.2:35866)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.001","status":"200" }
{ "@timestamp":"2019-01-27T14:37:33+00:00","connection":"531 (192.168.16.2:35868)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.000","status":"200" }
{ "@timestamp":"2019-01-27T14:37:33+00:00","connection":"532 (192.168.16.2:35870)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.000","status":"200" }
All 10 requests made by curb, but only 1 reused a connection. When I removed multi.max_connects = 1
multi would keep initial 5 FDs open and it would reuse it for the second batch of easy handles:
{ "@timestamp":"2019-01-27T14:46:12+00:00","connection":"538 (192.168.16.2:35914)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.001","status":"200" }
{ "@timestamp":"2019-01-27T14:46:12+00:00","connection":"539 (192.168.16.2:35916)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.000","status":"200" }
{ "@timestamp":"2019-01-27T14:46:12+00:00","connection":"540 (192.168.16.2:35918)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.000","status":"200" }
{ "@timestamp":"2019-01-27T14:46:12+00:00","connection":"541 (192.168.16.2:35920)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.000","status":"200" }
{ "@timestamp":"2019-01-27T14:46:12+00:00","connection":"542 (192.168.16.2:35922)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.000","status":"200" }
{ "@timestamp":"2019-01-27T14:46:23+00:00","connection":"538 (192.168.16.2:35914)","connection_requests_count":"2","request":"GET /sleep/1 HTTP/1.1","request_time":"1.002","status":"200" }
{ "@timestamp":"2019-01-27T14:46:23+00:00","connection":"539 (192.168.16.2:35916)","connection_requests_count":"2","request":"GET /sleep/1 HTTP/1.1","request_time":"1.002","status":"200" }
{ "@timestamp":"2019-01-27T14:46:23+00:00","connection":"540 (192.168.16.2:35918)","connection_requests_count":"2","request":"GET /sleep/1 HTTP/1.1","request_time":"1.002","status":"200" }
{ "@timestamp":"2019-01-27T14:46:23+00:00","connection":"541 (192.168.16.2:35920)","connection_requests_count":"2","request":"GET /sleep/1 HTTP/1.1","request_time":"1.002","status":"200" }
{ "@timestamp":"2019-01-27T14:46:23+00:00","connection":"542 (192.168.16.2:35922)","connection_requests_count":"2","request":"GET /sleep/1 HTTP/1.1","request_time":"1.002","status":"200" }
This is somewhat irrelevant to the FDs leak we discuss here, but I wanted to make sure we're not introducing another regression.
Regarding the FDs leak I think the best fix is to expose curl_multi_cleanup
to Ruby-land and allow user to call it at any time to release FDs. Something along the lines of:
URL = 'http://google.com'
multi = Curl::Multi.new
multi.add(Curl::Easy.new(URL))
multi.perform
# connections & FDs remain open and will be reused by next handle
multi.add(Curl::Easy.new(URL))
multi.perform
multi.close
# all connections & FDs are closed
multi.add(Curl::Easy.new(URL))
# => Curl::Err::MultiBadHandle: CURLError: Invalid multi handle
multi.perform
# => Curl::Err::MultiBadHandle: CURLError: Invalid multi handle
Each multi manages it's own pool of connections and it won't reuse FDs from another multi so leaving sockets hanging until GC runs feels quite wasteful. We could provide an option to autoclose a handle when perform
returns too (maybe Curl::Multi.autoclose = true
), it could be convenient to set it once (ie in Rails initializer) and never worry about it again.
I think it should be fairly simple and safe fix, so let me know if it's feasible.
@robuye great analysis, I like the idea of a Curl::Multi.autoclose = true
and exposing multi.close. Maybe something like the following: https://github.com/taf2/curb/tree/autoclose_close
I agree we need some regression tests for connection keep-alive and connection leaking.
Thanks! Your implementation is better than what I came up with. Cool idea with re-initializing @requests
.
I ran my gist on your branch and it seems to be working exactly as expected. Nice work :+1:
Great work, Rob and Todd. Thank you both so much for investing the time to hunt this down. Happy to test whenever the fix is in. I was looking at the behavior from a year back, but still odd I couldn't make it reproduce the issue at will. It must have been some GC timing issue and the way my app uses the Parallel gem to create these connections in threads.
@pts-owentran thanks, we've pushed 0.9.8 to rubygems. If you have chance to run this let me know, I think with the updates to include a close method and autoclose we're actually close to a 1.x release 😄
This code:
leaks connections.
After some time I am getting errors from sidekiq:
Error fetching job: Too many open files - getaddrinfo
.$ netstat -an | awk '/tcp/ {print $6}' | sort | uniq -c:
$ lsof -p 6245 | grep CLOSE_WAIT | wc -l
A part of
lsof
output: