Closed avsm closed 9 years ago
from accept (2)
: [EMFILE] The per-process descriptor table is full
-- does this mean there is still some file descriptor leakage?
and, what does ulimit -n
say on the system?
ulimit -n
1024
On 25 Apr 2015, at 13:48, Hannes Mehnert notifications@github.com wrote:
and, what does ulimit -n say on the system?
— Reply to this email directly or view it on GitHub.
Yeah there wasn't that much load on the system. It would certainly be good to just log the refused connection and not bomb out. How about a background logging thread that records the number of active file descriptors? Then I can keep an eye to make sure it's not increasing steadily over time.
can you try 9d54c98c2da52e752d3282a32b087b1afcb4822a with --logffds
, please? in the meantime I rewrote the failure paths (and tested locally with 100 concurrent connections, no fd leaks)
(I wanted to also ask the kernel/OS how many fds my application currently consumes, but there doesn't seem to be any POSIX API for this (on linux, people use sth in /proc
); hints appreciated)...
POSIX systems are supposed to allocate the next available fd so you could create an fd and then convert it to an int...
@dsheets thx, done in 001eae9573566a70fc1ad334f894bd9d641eeb10
Hmm... this is specified at http://pubs.opengroup.org/stage7tc1/functions/open.html (search "lowest") but not http://pubs.opengroup.org/stage7tc1/functions/socket.html. I found a similar discrepancy in the man pages. I don't know what happens in practice for other fd calls (but I believe and guess and hope that they have the same behavior).
Looks stable so far;
[17:05:08] fds: count 50, next 7, active 6, open 6, closed 16, aborted 0
[17:05:11] x.x.x.x:25281: connection closed read 408 bytes, wrote 65340 bytes
[17:06:08] fds: count 51, next 6, active 5, open 5, closed 2, aborted 0
@dsheets thx again, changed to open
@avsm cool. I have not guarded the accept
call in tlstunnel (because if the process is out of file descriptors, there's nothing reasonable to do for me anyways)
@hannesm if accept fails, just log and continue. Otherwise the entire process dies (including the other 1024 connections and all future attempts!)
it might result in the process spinning if accept
continuously fails, but that's still better than just bombing out
@avsm done
After a few hours:
[17:48:09] fds: count 635, next 6, active 90, open 90, closed 10, aborted 0
[17:49:09] fds: count 635, next 6, active 90, open 90, closed 0, aborted 0
Those active 90 look too high... there's no apparent traffic going on at the moment (apart form the odd connection)
I was wrong earlier and hadn't thought through your use case... the next fd stuff only works if you never close fds and thus don't make holes in the fd-space. Sorry for the distraction.
does sockstat
(or netstat
) tell whether there are actually 90 connections to 443 or the backend service? it might be sensible to pass through a timeout for the sockets!?
Yep, looks like there are roughly the same number of connections all stuck in a CLOSE_WAIT
state. So a socket isn't being fully closed somewhere.
does this number increase over time? from Lwt_unix
: | Closed
(* The file descriptor has been closed by val Lwt_unix.close
. It must
not be used for any operation. *)
unclear what to do with those sockets, apart from waiting for the timeout!?
and are those CLOSE_WAIT
connections between front and back, or front and hosts on the internet?
Btw -- the easiest way to check the number of open fds on Linux (and Linux only) would be to count the number of entries in the directory /proc/self/fd
.
@hannesm they are from the server to the client; the internal ones seem to close fine. e.g.
tcp 0 0 128.232.97.50:443 x.x.x.x:39365 CLOSE_WAIT
Where x.x.x.x
is a public Internet client.
ok, then there is somewhere a missing close
in the TLS part of the code (either here or in tls_lwt
)...
lovely, no leaks at all anymore.
[08:43:46] fds: count 6467, active 1, open 1, closed 6, aborted 0
Leaving this open in case there's an upstream fix needed in tls_lwt
, but I'm happy now!
tls_lwt
changed its semantics slightly, I have a branch https://github.com/hannesm/tlstunnel/tree/tls-nocrypto-master which adjusts to those changes (will merge once we have releases)