Unexpected error in accept: :emfile

woylie commented 1 year ago

We're getting a lot of these errors:

RuntimeError: Unexpected error in accept: :emfile
  File "lib/thousand_island/acceptor.ex", line 27, in ThousandIsland.Acceptor.accept/5
  File "lib/task/supervised.ex", line 89, in Task.Supervised.invoke_mfa/2
  File "proc_lib.erl", line 240, in :proc_lib.init_p_do_apply/3

Paired with:

** (exit) exited in: GenServer.call(:undefined, :acceptor_info, 5000)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
  File "lib/gen_server.ex", line 1027, in GenServer.call/3
  File "lib/thousand_island/acceptor.ex", line 10, in ThousandIsland.Acceptor.run/1
  File "lib/task/supervised.ex", line 89, in Task.Supervised.invoke_mfa/2
  File "proc_lib.erl", line 240, in :proc_lib.init_p_do_apply/3

This happens in a Phoenix application with thousand island 0.6.5 and bandit 0.7.4.

Any indication what could cause this?

moogle19 commented 1 year ago

:emfile means "Too many open files" (see here). It looks like you have more connections than available file descriptors. You could try to increase the file descriptor limit for your server.

woylie commented 1 year ago

We increased the file descriptor limit before.

For more context, we saw the memory usage continuously increasing over time, without resources ever being freed. After changing the adapter back to Cowboy, we don't see this happening anymore, and we don't see any errors indicating too many connections.

moogle19 commented 1 year ago

Could you pin down which resources wouldn't get freed? I would assume Ports / Sockets, but it would be nice to have further informations.

mtrudel commented 1 year ago

I'm extremely interested / perplexed by this!

My first hunch here is that there's some aspect of your situation that's causing connections to not be closed off once you're done with them. My initial guess would be that there may be a bug in our HTTP/1 keepalive logic that's causing us to not be as aggressive as Cowboy in closing off lingering keep alive clients.

Some questions to help diagnose:

What mix of endpoints / transports are you using? HTTP/HTTPS? 1.1/2? WebSockets?
What config are you starting Bandit / Cowboy up with?
Do you see any other Bandit / Thousand Island messages in your logs (specifically timeout messages)?
Do the errors happen over time, all at once, after a certain number of connections? Anything we can do to try and repro?
What sort of connection rate (distinct connections / second, requests per connection, etc) does the system in question serve?
Are you able to share methods & charts that demonstrate 'memory usage continuously increasing over time'? Do you happen to have a metric available for total number of VM processes as well? Can you share that?

(on the off chance that you're able to share aspects of the project causing this, it would help immensely in diagnosing this).

Looking forward to hearing more!

woylie commented 1 year ago

* What mix of endpoints / transports are you using? HTTP/HTTPS? 1.1/2? WebSockets?

The endpoint serves HTTP 1.1 and WebSockets (LiveView) behind an Envoy proxy that terminates TLS.

* What config are you starting Bandit / Cowboy up with?

Only defaults.

* Do you see any other Bandit / Thousand Island messages in your logs (specifically timeout messages)?
* Do the errors happen over time, all at once, after a certain number of connections? Anything we can do to try and repro?
* What sort of connection rate (distinct connections / second, requests per connection, etc) does the system in question serve?
* Are you able to share methods & charts that demonstrate 'memory usage continuously increasing over time'? Do you happen to have a metric available for total number of VM processes as well? Can you share that?

I'll have to get back to you for that. We're pretty busy at the moment, and I'll be out of office from next week, so it may take a while. The only thing I can give you right now is this 24h error distribution chart:

From top to bottom:

GenServer.stop(:undefined, :normal, :infinity) ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started (no further stacktrace)
** (exit) exited in: GenServer.call(:undefined, :acceptor_info, 5000)
Unexpected error in accept: :emfile

We also saw a couple of ** (exit) "Not a valid WebSocket upgrade request") errors.

mtrudel commented 1 year ago

When you're running under Cowboy, do you ever see log messages like the following?

Ranch acceptor reducing accept rate: out of file descriptors

Looking at where this error is happening in Thousand Island, the corresponding implementation in Ranch explicitly handles :emfile errors by waiting for 100ms and retrying an accept call. Ranch and Thousand Island are very similar in structure in this regard, so I would expect that Ranch is seeing the same :emfile error we are, they're just handling it.

If this is the case, I'll add a similar clause to Thousand Island (though I may handle it a tad differently; busy waiting without end like Ranch does is a tad sketch).

woylie commented 1 year ago

I didn't find that message in our logs.

mtrudel commented 1 year ago

I've added connection count limiting to Thousand Island (released as 0.6.6). This doesn't fix the core of the issue (why you're seeing increased memory usage & emfile errors), but it should help mitigate the issue for others via configuration levers, and brings us up to parity with ranch's support for this.

I'm still looking into the underlying issue, but I'm not able to reproduce anything locally. I've run a bunch of long-running soak tests on the HTTP/1, h2, and WebSocket stacks and I'm seeing what I would expect: no process growth and steady long-term memory usage. I still think that there's something in your setup that's causing connections to be held open longer than expected. This would explain everything you're seeing (growing memory use over time, and an eventual exhaustion of open file handles). My biggest suspects here are keepalive handling and/or a possible bug with handling invalid websocket upgrades.

To help debug this further, would you be able to provide:

Any Bandit / Thousand Island messages in your logs (specifically timeout messages) around / prior to this happening
A rough idea of connection rates, connection duration, HTTP/1 vs websocket breakdown / overall load & box size etc, so I know what scale of problem this may be
Can you quantify 'memory growing over time'? How fast does it grow, and by how much?

In the meantime, if you're game to help debugging this, you could try adding [http_1_options: [max_requests: 1]] to your bandit config (this would manifest in your Phoenix app's config.exs (or env sub-config) like so):

config :my_app, MyAppWeb.Endpoint,
  http: [...(port: xxx, other options)... , http_1_options: [max_requests: 1]]
  ...

This will cause Bandit's HTTP/1 stack to close the connection after each request, effectively disabling keepalives. If you could run this for a long enough period of time that you would normally have seen an error and see if it helps matters, it would be a great place to help bisect this error.

Thanks!

ringofhealth commented 1 year ago

okay I ran this config

this is my config.exs

config :myapp, AppWeb.Endpoint,
  adapter: Bandit.PhoenixAdapter,
  url: [host: "localhost"],

my dev.exs

config :myapp, AppWeb.Endpoint,
  http: [ip: {127, 0, 0, 1}, port: 4000],
  http_1_options: [max_requests: 1],

with that http_1_option on, and it was good until I swap out the config

after I commented that option out, and reran the app, the VM blew up within 5-10 min with

                     []},
                 {'Elixir.Kernel',inspect,2,
                     [{file,"lib/kernel.ex"},{line,2254}]},
                 {'Elixir.Logger.Handler',do_log,4,
                     [{file,"lib/logger/handler.ex"},{line,141}]},
                 {'Elixir.Logger.Handler',log,2,
                     [{file,"lib/logger/handler.ex"},{line,84}]}]}
[warning] Attempted to log 140761 messages, which is above :discard_threshold
[warning] Attempted to log 0 messages, which is below :discard_threshold

ringofhealth commented 1 year ago

I am using the latest build 1.0.0-pre2, let me know if there is more diagnostic I can run to help you narrow this down

ringofhealth commented 1 year ago

aight something wierd is going on. so I originally ran it with the http_1_option, on then I turn it off and ran into a bunch of emfile error. Then I took out the option swapped out bandit and went back to cowboy.

while I was using other part of the application I ran into the emfile error, which has never happened before. I deleted my entire build dir and recompiled withoht bandit. we will see what happens

it seems that it had some side effect on the erlang installaton in asdf dir

mtrudel commented 1 year ago

@ringofhealth a couple of things that would be good to know:

What mix of endpoints / transports are you using? HTTP/HTTPS? 1.1/2? WebSockets?
What sort of clients are they? Are they direct from a browser or are they being forwarded through a reverse proxy? Which one if so?
Roughly how many connections per second are you handling? How long / how many connections does it take to start seeing emails errors?

mtrudel commented 1 year ago

@ringofhealth I'm confused about your indication of max_requests: 1 solving the problem. Are you saying that:

When you used a default Bandit config (ie: WITHOUT max_requests: 1, you saw the issue occur
When you added max_requests: 1 to your config, it fixed the problem and the issue did not recur?

ringofhealth commented 1 year ago

@ringofhealth a couple of things that would be good to know:

What mix of endpoints / transports are you using? HTTP/HTTPS? 1.1/2? WebSockets?

What sort of clients are they? Are they direct from a browser or are they being forwarded through a reverse proxy? Which one if so?

Roughly how many connections per second are you handling? How long / how many connections does it take to start seeing emails errors?

this was all during local development, so over HTTP only, and with websockets for liveview 0.18.18. I am not sure if its loading HTTP/2, just using default configs
no reverse proxy, just local dev with local chrome
it was local dev env, so just my 1 user was using it

ringofhealth commented 1 year ago

@ringofhealth I'm confused about your indication of max_requests: 1 solving the problem. Are you saying that:

When you used a default Bandit config (ie: WITHOUT max_requests: 1, you saw the issue occur

When you added max_requests: 1 to your config, it fixed the problem and the issue did not recur?

yes I was confused about this too, let me walk you through the timelines

-added bandited, with max_request:1 config on

this seemingly was good for a while, used it for about and hour to 2. Then I decided to try it without max_request: 1 setting

within 5 - 10 min, the error started popping up.
went back to max_config:1
error still popped up after
swapped out the adaptor line for bandit for cowboy
started running some heavy IO bound job for my application locally
EM error popped up for that as well
deleted my deps dir, removed bandit completed from mix and recompiled.
have not tried the jobs yet, but cowboy is stable

mtrudel commented 1 year ago

@ringofhealth those are some very interesting points! I've tried to reproduce this locally and can't seem to (despite working in a Phoenix app for an hour or so, I'm not seeing anything unexpected in any process or file metrics). A few questions:

What OTP/Erlang are you using (copying & pasting the startup banner from an iex session would provide everything I'd need to know)?
What OS are you on? The output of uname -a would be great!
If you're able to, it would be great if you could:
- Add adapter: Bandit.PhoenixAdapter to your config, WITHOUT any special options such as max_requests
- rm -rf _build just to ensure you get any old artifacts out of the way
- Provide screenshots of http://localhost:4000/dev/dashboard at startup, and after a few minutes of development (you'd indicated that the error occurs after 5-10 minutes of working, so maybe grab a screenshot after 2-3 minutes)
- If you're able to, provide a screenshot after the error starts occurring

Thanks so much!

ringofhealth commented 1 year ago

aight will do! Ill start some test session this weekend

mtrudel commented 1 year ago

I think I've figured this out. Point release coming in the next few hours with a fix for testing!

mtrudel commented 1 year ago

I just released Bandit 1.0.0-pre.3 with a fix for this issue (I hope!)

Please update to 1.0.0-pre.3 or newer of Bandit and ensure that the issue no longer occurs. You DO NOT need to use max_requests: 1 any longer.

ringofhealth commented 1 year ago

Thank you for digging into this. It does look like the lib is not throwing out emfile errors.

I do notice this issue on local dev mode

not sure why this is coming up now

[info] GET /live/longpoll
[debug] ** (Phoenix.Router.NoRouteError) no route found for GET /live/longpoll (HelioWeb.Router)
    (helio 0.1.0) lib/phoenix/router.ex:487: HelioWeb.Router.call/2
    (helio 0.1.0) lib/helio_web/endpoint.ex:1: HelioWeb.Endpoint.plug_builder_call/2
    (helio 0.1.0) lib/plug/debugger.ex:136: HelioWeb.Endpoint."call (overridable 3)"/2
    (helio 0.1.0) lib/helio_web/endpoint.ex:1: HelioWeb.Endpoint.call/2
    (phoenix 1.7.2) lib/phoenix/endpoint/sync_code_reload_plug.ex:22: Phoenix.Endpoint.SyncCodeReloadPlug.do_call/4
    (bandit 1.0.0-pre.3) lib/bandit/pipeline.ex:89: Bandit.Pipeline.call_plug/2
    (bandit 1.0.0-pre.3) lib/bandit/pipeline.ex:27: Bandit.Pipeline.run/6
    (bandit 1.0.0-pre.3) lib/bandit/http1/handler.ex:29: Bandit.HTTP1.Handler.handle_data/3
    (bandit 1.0.0-pre.3) lib/bandit/delegating_handler.ex:18: Bandit.DelegatingHandler.handle_data/3
    (bandit 1.0.0-pre.3) lib/thousand_island/handler.ex:355: Bandit.DelegatingHandler.handle_continue/2
    (stdlib 4.1.1) gen_server.erl:1123: :gen_server.try_dispatch/4
    (stdlib 4.1.1) gen_server.erl:865: :gen_server.loop/7
    (stdlib 4.1.1) proc_lib.erl:240: :proc_lib.init_p_do_apply/3

[error] GenServer #PID<0.4812.0> terminating
** (Phoenix.Router.NoRouteError) no route found for GET /live/longpoll (HelioWeb.Router)
    (helio 0.1.0) lib/phoenix/router.ex:487: HelioWeb.Router.call/2
    (helio 0.1.0) lib/helio_web/endpoint.ex:1: HelioWeb.Endpoint.plug_builder_call/2
    (helio 0.1.0) lib/plug/debugger.ex:136: HelioWeb.Endpoint."call (overridable 3)"/2
    (helio 0.1.0) lib/helio_web/endpoint.ex:1: HelioWeb.Endpoint.call/2
    (phoenix 1.7.2) lib/phoenix/endpoint/sync_code_reload_plug.ex:22: Phoenix.Endpoint.SyncCodeReloadPlug.do_call/4
    (bandit 1.0.0-pre.3) lib/bandit/pipeline.ex:89: Bandit.Pipeline.call_plug/2
    (bandit 1.0.0-pre.3) lib/bandit/pipeline.ex:27: Bandit.Pipeline.run/6
    (bandit 1.0.0-pre.3) lib/bandit/http1/handler.ex:29: Bandit.HTTP1.Handler.handle_data/3
    (bandit 1.0.0-pre.3) lib/bandit/delegating_handler.ex:18: Bandit.DelegatingHandler.handle_data/3
    (bandit 1.0.0-pre.3) lib/thousand_island/handler.ex:355: Bandit.DelegatingHandler.handle_continue/2
    (stdlib 4.1.1) gen_server.erl:1123: :gen_server.try_dispatch/4
    (stdlib 4.1.1) gen_server.erl:865: :gen_server.loop/7
    (stdlib 4.1.1) proc_lib.erl:240: :proc_lib.init_p_do_apply/3
Last message: {:continue, :handle_connection}
State: {%ThousandIsland.Socket{socket: #Port<0.117>, transport_module: ThousandIsland.Transports.TCP, read_timeout: 60000, span: %ThousandIsland.Telemetry{span_name: :connection, telemetry_span_context: #Reference<0.494451594.3792437251.136067>, start_time: -576460569945550400, start_metadata: %{parent_telemetry_span_context: #Reference<0.494451594.3792437249.137336>, remote_address: {127, 0, 0, 1}, remote_port: 57111, telemetry_span_context: #Reference<0.494451594.3792437251.136067>}}}, %{handler_module: Bandit.InitialHandler, http_1_enabled: true, http_2_enabled: true, opts: %{http_1: [], http_2: [], websocket: []}, plug: {Phoenix.Endpoint.SyncCodeReloadPlug, {HelioWeb.Endpoint, []}}, websocket_enabled: true}}

mtrudel commented 1 year ago

Yep; that's being worked upstream at https://github.com/phoenixframework/phoenix/issues/5445

In the meantime it's great to hear we've solved the original issue here; thanks for your help and input!

I'll wait a few days for @woylie to chime in before closing this issue.

ringofhealth commented 1 year ago

ah I see, thanks for clarifying. Going to deploy this to staging env and let it soak for a few days, will report back status

woylie commented 1 year ago

Thank you for digging into this. I just updated Bandit to 1.0.0-pre.3, and we'll be monitoring this. I'll let you know if this occurs again, and if it doesn't occur again, you'll hear from me in a week.

woylie commented 1 year ago

Sorry, forgot about this. We didn't encounter that issue again since switching back to Bandit.

mtrudel commented 1 year ago

Great to hear! Thanks again for the original issue!

mtrudel commented 1 year ago

@ringofhealth the leaking NoRouteError upstream issue has been merged - this should be fixed in the next Phoenix release.

mtrudel / thousand_island

Unexpected error in accept: :emfile #52