HTTPServer infinite loop on "too many open files"

jemc commented 2 years ago

On Linux, when using the examples/http server and testing it with a large number of simultaneous connections, I commonly see one thread get stuck in an infinite loop, and the server fails to accept more connections.

After stepping through the issue in lldb, I can see that the TCPListener actor gets stuck in an infinite loop when it tries to run pony_os_accept to accept a connection and that call fails with EMFILE ("too many open files") as the errno. The TCPListener is currently written to naively assume that any failure can be considered spurious, and it tries again to accept a connection. In this failure mode, trying again to accept a new connection will not help - it will always fail. Hence, it gets stuck in an infinite retry loop.

jemc commented 2 years ago

Interestingly, the naive code that causes this infinite loop was copied directly from Pony: https://github.com/ponylang/ponyc/blob/725fac75381c35da5803c83bd00fbd552261f099/packages/net/tcp_listener.pony#L209-L222

Our version of that same code is here: https://github.com/savi-lang/savi/blob/739b8abb2ab393218778066db775ed1124a74cfe/packages/src/TCP/Listener.savi#L78-L85

So I'd be surprised if the Pony server didn't have a similar bug/flaw in it. Unless there is some other way of addressing this case in the Pony server that I'm not noticing.

jemc commented 2 years ago

I convinced myself that Pony has the same bug, so I opened https://github.com/ponylang/ponyc/issues/4052 to discuss with the rest of the Pony team and decide whether this fix should be handled within the runtime C function (pony_os_accept) or in the Savi/Pony code that calls it.

jemc commented 1 year ago

One thing I noticed while refactoring the current code, which may explain how we get to the "too many open files" condition in the first place:

I think the code that uses io_deferred_action to let the listener know when its accepted child has been closed is not having the desired effect of reaching the TCP.Listen.Engine to clean up the accepted child socket, because as currently written the IO.Action.ClosedChild will only reach the top level IO.Actor - the engine(s) inside won't be able to react to it.

I likely need to come up with a different design for deferred actions that lets the inner engine(s) react to the deferred action from the inside out, like the normal reaction process. There are two ways to look at going about this:

:a: Use pony_asio_event_send to send a "true" event which gets reacted to in the normal way (but how do we attach the specific IO.Action? do we even need to attach an action? or can we repurpose some existing field as the action code?).
:b: Separate the method for an engine "categorizing" an event as an IO.Action from the code that reacts to it, so that each engine gets a chance to categorize first in the normal flow, and in the deferred flow we skip the categorization step.

It's also possible these two ideas could/should be used in conjunction with one another to achieve the right design. Or perhaps there is another way of doing it.

Those are just my current thoughts about the issue.

savi-lang / TCP

HTTPServer infinite loop on "too many open files" #2