hannesm / tlstunnel-lwt

TLS tunnel -- an alternative to stud / stunnel
BSD 2-Clause "Simplified" License
83 stars 7 forks source link

accept leaks exception #15

Closed avsm closed 9 years ago

avsm commented 9 years ago
tlstunnel: internal error, uncaught exception:
           Unix.Unix_error(Unix.EMFILE, "accept", "")
           Raised at file "src/core/lwt.ml", line 788, characters 22-23
           Called from file "src/unix/lwt_main.ml", line 34, characters 8-18
hannesm commented 9 years ago

from accept (2): [EMFILE] The per-process descriptor table is full -- does this mean there is still some file descriptor leakage?

hannesm commented 9 years ago

and, what does ulimit -n say on the system?

avsm commented 9 years ago
ulimit -n
1024

On 25 Apr 2015, at 13:48, Hannes Mehnert notifications@github.com wrote:

and, what does ulimit -n say on the system?

— Reply to this email directly or view it on GitHub.

avsm commented 9 years ago

Yeah there wasn't that much load on the system. It would certainly be good to just log the refused connection and not bomb out. How about a background logging thread that records the number of active file descriptors? Then I can keep an eye to make sure it's not increasing steadily over time.

hannesm commented 9 years ago

can you try 9d54c98c2da52e752d3282a32b087b1afcb4822a with --logffds, please? in the meantime I rewrote the failure paths (and tested locally with 100 concurrent connections, no fd leaks)

hannesm commented 9 years ago

(I wanted to also ask the kernel/OS how many fds my application currently consumes, but there doesn't seem to be any POSIX API for this (on linux, people use sth in /proc); hints appreciated)...

dsheets commented 9 years ago

POSIX systems are supposed to allocate the next available fd so you could create an fd and then convert it to an int...

hannesm commented 9 years ago

@dsheets thx, done in 001eae9573566a70fc1ad334f894bd9d641eeb10

dsheets commented 9 years ago

Hmm... this is specified at http://pubs.opengroup.org/stage7tc1/functions/open.html (search "lowest") but not http://pubs.opengroup.org/stage7tc1/functions/socket.html. I found a similar discrepancy in the man pages. I don't know what happens in practice for other fd calls (but I believe and guess and hope that they have the same behavior).

avsm commented 9 years ago

Looks stable so far;

[17:05:08] fds: count 50, next 7, active 6, open 6, closed 16, aborted 0
[17:05:11] x.x.x.x:25281: connection closed read 408 bytes, wrote 65340 bytes
[17:06:08] fds: count 51, next 6, active 5, open 5, closed 2, aborted 0
hannesm commented 9 years ago

@dsheets thx again, changed to open @avsm cool. I have not guarded the accept call in tlstunnel (because if the process is out of file descriptors, there's nothing reasonable to do for me anyways)

avsm commented 9 years ago

@hannesm if accept fails, just log and continue. Otherwise the entire process dies (including the other 1024 connections and all future attempts!)

avsm commented 9 years ago

it might result in the process spinning if accept continuously fails, but that's still better than just bombing out

hannesm commented 9 years ago

@avsm done

avsm commented 9 years ago

After a few hours:

[17:48:09] fds: count 635, next 6, active 90, open 90, closed 10, aborted 0
[17:49:09] fds: count 635, next 6, active 90, open 90, closed 0, aborted 0

Those active 90 look too high... there's no apparent traffic going on at the moment (apart form the odd connection)

dsheets commented 9 years ago

I was wrong earlier and hadn't thought through your use case... the next fd stuff only works if you never close fds and thus don't make holes in the fd-space. Sorry for the distraction.

hannesm commented 9 years ago

does sockstat (or netstat) tell whether there are actually 90 connections to 443 or the backend service? it might be sensible to pass through a timeout for the sockets!?

avsm commented 9 years ago

Yep, looks like there are roughly the same number of connections all stuck in a CLOSE_WAIT state. So a socket isn't being fully closed somewhere.

hannesm commented 9 years ago

does this number increase over time? from Lwt_unix: | Closed (* The file descriptor has been closed by val Lwt_unix.close. It must not be used for any operation. *)

unclear what to do with those sockets, apart from waiting for the timeout!?

hannesm commented 9 years ago

and are those CLOSE_WAIT connections between front and back, or front and hosts on the internet?

pqwy commented 9 years ago

Btw -- the easiest way to check the number of open fds on Linux (and Linux only) would be to count the number of entries in the directory /proc/self/fd.

avsm commented 9 years ago

@hannesm they are from the server to the client; the internal ones seem to close fine. e.g.

tcp        0      0 128.232.97.50:443       x.x.x.x:39365     CLOSE_WAIT 

Where x.x.x.x is a public Internet client.

hannesm commented 9 years ago

ok, then there is somewhere a missing close in the TLS part of the code (either here or in tls_lwt)...

avsm commented 9 years ago

lovely, no leaks at all anymore.

[08:43:46] fds: count 6467, active 1, open 1, closed 6, aborted 0

Leaving this open in case there's an upstream fix needed in tls_lwt, but I'm happy now!

hannesm commented 9 years ago

tls_lwt changed its semantics slightly, I have a branch https://github.com/hannesm/tlstunnel/tree/tls-nocrypto-master which adjusts to those changes (will merge once we have releases)