Open jemc opened 2 years ago
Interestingly, the naive code that causes this infinite loop was copied directly from Pony: https://github.com/ponylang/ponyc/blob/725fac75381c35da5803c83bd00fbd552261f099/packages/net/tcp_listener.pony#L209-L222
Our version of that same code is here: https://github.com/savi-lang/savi/blob/739b8abb2ab393218778066db775ed1124a74cfe/packages/src/TCP/Listener.savi#L78-L85
So I'd be surprised if the Pony server didn't have a similar bug/flaw in it. Unless there is some other way of addressing this case in the Pony server that I'm not noticing.
I convinced myself that Pony has the same bug, so I opened https://github.com/ponylang/ponyc/issues/4052 to discuss with the rest of the Pony team and decide whether this fix should be handled within the runtime C function (pony_os_accept
) or in the Savi/Pony code that calls it.
One thing I noticed while refactoring the current code, which may explain how we get to the "too many open files" condition in the first place:
I think the code that uses io_deferred_action
to let the listener know when its accepted child has been closed is not having the desired effect of reaching the TCP.Listen.Engine
to clean up the accepted child socket, because as currently written the IO.Action.ClosedChild
will only reach the top level IO.Actor
- the engine(s) inside won't be able to react to it.
I likely need to come up with a different design for deferred actions that lets the inner engine(s) react to the deferred action from the inside out, like the normal reaction process. There are two ways to look at going about this:
pony_asio_event_send
to send a "true" event which gets reacted to in the normal way (but how do we attach the specific IO.Action
? do we even need to attach an action? or can we repurpose some existing field as the action code?).IO.Action
from the code that reacts to it, so that each engine gets a chance to categorize first in the normal flow, and in the deferred flow we skip the categorization step.It's also possible these two ideas could/should be used in conjunction with one another to achieve the right design. Or perhaps there is another way of doing it.
Those are just my current thoughts about the issue.
On Linux, when using the
examples/http
server and testing it with a large number of simultaneous connections, I commonly see one thread get stuck in an infinite loop, and the server fails to accept more connections.After stepping through the issue in
lldb
, I can see that theTCPListener
actor gets stuck in an infinite loop when it tries to runpony_os_accept
to accept a connection and that call fails withEMFILE
("too many open files") as theerrno
. TheTCPListener
is currently written to naively assume that any failure can be considered spurious, and it tries again to accept a connection. In this failure mode, trying again to accept a new connection will not help - it will always fail. Hence, it gets stuck in an infinite retry loop.