Closed woylie closed 1 year ago
:emfile
means "Too many open files" (see here).
It looks like you have more connections than available file descriptors. You could try to increase the file descriptor limit for your server.
We increased the file descriptor limit before.
For more context, we saw the memory usage continuously increasing over time, without resources ever being freed. After changing the adapter back to Cowboy, we don't see this happening anymore, and we don't see any errors indicating too many connections.
Could you pin down which resources wouldn't get freed? I would assume Ports / Sockets, but it would be nice to have further informations.
I'm extremely interested / perplexed by this!
My first hunch here is that there's some aspect of your situation that's causing connections to not be closed off once you're done with them. My initial guess would be that there may be a bug in our HTTP/1 keepalive logic that's causing us to not be as aggressive as Cowboy in closing off lingering keep alive clients.
Some questions to help diagnose:
(on the off chance that you're able to share aspects of the project causing this, it would help immensely in diagnosing this).
Looking forward to hearing more!
* What mix of endpoints / transports are you using? HTTP/HTTPS? 1.1/2? WebSockets?
The endpoint serves HTTP 1.1 and WebSockets (LiveView) behind an Envoy proxy that terminates TLS.
* What config are you starting Bandit / Cowboy up with?
Only defaults.
* Do you see any other Bandit / Thousand Island messages in your logs (specifically timeout messages)? * Do the errors happen over time, all at once, after a certain number of connections? Anything we can do to try and repro? * What sort of connection rate (distinct connections / second, requests per connection, etc) does the system in question serve? * Are you able to share methods & charts that demonstrate 'memory usage continuously increasing over time'? Do you happen to have a metric available for total number of VM processes as well? Can you share that?
I'll have to get back to you for that. We're pretty busy at the moment, and I'll be out of office from next week, so it may take a while. The only thing I can give you right now is this 24h error distribution chart:
From top to bottom:
We also saw a couple of ** (exit) "Not a valid WebSocket upgrade request")
errors.
When you're running under Cowboy, do you ever see log messages like the following?
Ranch acceptor reducing accept rate: out of file descriptors
Looking at where this error is happening in Thousand Island, the corresponding implementation in Ranch explicitly handles :emfile
errors by waiting for 100ms and retrying an accept call. Ranch and Thousand Island are very similar in structure in this regard, so I would expect that Ranch is seeing the same :emfile
error we are, they're just handling it.
If this is the case, I'll add a similar clause to Thousand Island (though I may handle it a tad differently; busy waiting without end like Ranch does is a tad sketch).
I didn't find that message in our logs.
I've added connection count limiting to Thousand Island (released as 0.6.6). This doesn't fix the core of the issue (why you're seeing increased memory usage & emfile
errors), but it should help mitigate the issue for others via configuration levers, and brings us up to parity with ranch's support for this.
I'm still looking into the underlying issue, but I'm not able to reproduce anything locally. I've run a bunch of long-running soak tests on the HTTP/1, h2, and WebSocket stacks and I'm seeing what I would expect: no process growth and steady long-term memory usage. I still think that there's something in your setup that's causing connections to be held open longer than expected. This would explain everything you're seeing (growing memory use over time, and an eventual exhaustion of open file handles). My biggest suspects here are keepalive handling and/or a possible bug with handling invalid websocket upgrades.
To help debug this further, would you be able to provide:
In the meantime, if you're game to help debugging this, you could try adding [http_1_options: [max_requests: 1]]
to your bandit config (this would manifest in your Phoenix app's config.exs (or env sub-config) like so):
config :my_app, MyAppWeb.Endpoint,
http: [...(port: xxx, other options)... , http_1_options: [max_requests: 1]]
...
This will cause Bandit's HTTP/1 stack to close the connection after each request, effectively disabling keepalives. If you could run this for a long enough period of time that you would normally have seen an error and see if it helps matters, it would be a great place to help bisect this error.
Thanks!
okay I ran this config
this is my config.exs
config :myapp, AppWeb.Endpoint,
adapter: Bandit.PhoenixAdapter,
url: [host: "localhost"],
my dev.exs
config :myapp, AppWeb.Endpoint,
http: [ip: {127, 0, 0, 1}, port: 4000],
http_1_options: [max_requests: 1],
with that http_1_option on, and it was good until I swap out the config
after I commented that option out, and reran the app, the VM blew up within 5-10 min with
[]},
{'Elixir.Kernel',inspect,2,
[{file,"lib/kernel.ex"},{line,2254}]},
{'Elixir.Logger.Handler',do_log,4,
[{file,"lib/logger/handler.ex"},{line,141}]},
{'Elixir.Logger.Handler',log,2,
[{file,"lib/logger/handler.ex"},{line,84}]}]}
[warning] Attempted to log 140761 messages, which is above :discard_threshold
[warning] Attempted to log 0 messages, which is below :discard_threshold
I am using the latest build 1.0.0-pre2, let me know if there is more diagnostic I can run to help you narrow this down
aight something wierd is going on. so I originally ran it with the http_1_option, on then I turn it off and ran into a bunch of emfile error. Then I took out the option swapped out bandit and went back to cowboy.
while I was using other part of the application I ran into the emfile error, which has never happened before. I deleted my entire build dir and recompiled withoht bandit. we will see what happens
it seems that it had some side effect on the erlang installaton in asdf dir
@ringofhealth a couple of things that would be good to know:
@ringofhealth I'm confused about your indication of max_requests: 1
solving the problem. Are you saying that:
max_requests: 1
, you saw the issue occurmax_requests: 1
to your config, it fixed the problem and the issue did not recur?@ringofhealth a couple of things that would be good to know:
- What mix of endpoints / transports are you using? HTTP/HTTPS? 1.1/2? WebSockets?
- What sort of clients are they? Are they direct from a browser or are they being forwarded through a reverse proxy? Which one if so?
- Roughly how many connections per second are you handling? How long / how many connections does it take to start seeing emails errors?
@ringofhealth I'm confused about your indication of
max_requests: 1
solving the problem. Are you saying that:
- When you used a default Bandit config (ie: WITHOUT
max_requests: 1
, you saw the issue occur- When you added
max_requests: 1
to your config, it fixed the problem and the issue did not recur?
yes I was confused about this too, let me walk you through the timelines
-added bandited, with max_request:1 config on
this seemingly was good for a while, used it for about and hour to 2. Then I decided to try it without max_request: 1 setting
adaptor
line for bandit for cowboy@ringofhealth those are some very interesting points! I've tried to reproduce this locally and can't seem to (despite working in a Phoenix app for an hour or so, I'm not seeing anything unexpected in any process or file metrics). A few questions:
What OTP/Erlang are you using (copying & pasting the startup banner from an iex
session would provide everything I'd need to know)?
What OS are you on? The output of uname -a
would be great!
If you're able to, it would be great if you could:
adapter: Bandit.PhoenixAdapter
to your config, WITHOUT any special options such as max_requests
rm -rf _build
just to ensure you get any old artifacts out of the wayhttp://localhost:4000/dev/dashboard
at startup, and after a few minutes of development (you'd indicated that the error occurs after 5-10 minutes of working, so maybe grab a screenshot after 2-3 minutes)Thanks so much!
aight will do! Ill start some test session this weekend
I think I've figured this out. Point release coming in the next few hours with a fix for testing!
I just released Bandit 1.0.0-pre.3 with a fix for this issue (I hope!)
Please update to 1.0.0-pre.3 or newer of Bandit and ensure that the issue no longer occurs. You DO NOT need to use max_requests: 1
any longer.
Thank you for digging into this. It does look like the lib is not throwing out emfile errors.
I do notice this issue on local dev
mode
not sure why this is coming up now
[info] GET /live/longpoll
[debug] ** (Phoenix.Router.NoRouteError) no route found for GET /live/longpoll (HelioWeb.Router)
(helio 0.1.0) lib/phoenix/router.ex:487: HelioWeb.Router.call/2
(helio 0.1.0) lib/helio_web/endpoint.ex:1: HelioWeb.Endpoint.plug_builder_call/2
(helio 0.1.0) lib/plug/debugger.ex:136: HelioWeb.Endpoint."call (overridable 3)"/2
(helio 0.1.0) lib/helio_web/endpoint.ex:1: HelioWeb.Endpoint.call/2
(phoenix 1.7.2) lib/phoenix/endpoint/sync_code_reload_plug.ex:22: Phoenix.Endpoint.SyncCodeReloadPlug.do_call/4
(bandit 1.0.0-pre.3) lib/bandit/pipeline.ex:89: Bandit.Pipeline.call_plug/2
(bandit 1.0.0-pre.3) lib/bandit/pipeline.ex:27: Bandit.Pipeline.run/6
(bandit 1.0.0-pre.3) lib/bandit/http1/handler.ex:29: Bandit.HTTP1.Handler.handle_data/3
(bandit 1.0.0-pre.3) lib/bandit/delegating_handler.ex:18: Bandit.DelegatingHandler.handle_data/3
(bandit 1.0.0-pre.3) lib/thousand_island/handler.ex:355: Bandit.DelegatingHandler.handle_continue/2
(stdlib 4.1.1) gen_server.erl:1123: :gen_server.try_dispatch/4
(stdlib 4.1.1) gen_server.erl:865: :gen_server.loop/7
(stdlib 4.1.1) proc_lib.erl:240: :proc_lib.init_p_do_apply/3
[error] GenServer #PID<0.4812.0> terminating
** (Phoenix.Router.NoRouteError) no route found for GET /live/longpoll (HelioWeb.Router)
(helio 0.1.0) lib/phoenix/router.ex:487: HelioWeb.Router.call/2
(helio 0.1.0) lib/helio_web/endpoint.ex:1: HelioWeb.Endpoint.plug_builder_call/2
(helio 0.1.0) lib/plug/debugger.ex:136: HelioWeb.Endpoint."call (overridable 3)"/2
(helio 0.1.0) lib/helio_web/endpoint.ex:1: HelioWeb.Endpoint.call/2
(phoenix 1.7.2) lib/phoenix/endpoint/sync_code_reload_plug.ex:22: Phoenix.Endpoint.SyncCodeReloadPlug.do_call/4
(bandit 1.0.0-pre.3) lib/bandit/pipeline.ex:89: Bandit.Pipeline.call_plug/2
(bandit 1.0.0-pre.3) lib/bandit/pipeline.ex:27: Bandit.Pipeline.run/6
(bandit 1.0.0-pre.3) lib/bandit/http1/handler.ex:29: Bandit.HTTP1.Handler.handle_data/3
(bandit 1.0.0-pre.3) lib/bandit/delegating_handler.ex:18: Bandit.DelegatingHandler.handle_data/3
(bandit 1.0.0-pre.3) lib/thousand_island/handler.ex:355: Bandit.DelegatingHandler.handle_continue/2
(stdlib 4.1.1) gen_server.erl:1123: :gen_server.try_dispatch/4
(stdlib 4.1.1) gen_server.erl:865: :gen_server.loop/7
(stdlib 4.1.1) proc_lib.erl:240: :proc_lib.init_p_do_apply/3
Last message: {:continue, :handle_connection}
State: {%ThousandIsland.Socket{socket: #Port<0.117>, transport_module: ThousandIsland.Transports.TCP, read_timeout: 60000, span: %ThousandIsland.Telemetry{span_name: :connection, telemetry_span_context: #Reference<0.494451594.3792437251.136067>, start_time: -576460569945550400, start_metadata: %{parent_telemetry_span_context: #Reference<0.494451594.3792437249.137336>, remote_address: {127, 0, 0, 1}, remote_port: 57111, telemetry_span_context: #Reference<0.494451594.3792437251.136067>}}}, %{handler_module: Bandit.InitialHandler, http_1_enabled: true, http_2_enabled: true, opts: %{http_1: [], http_2: [], websocket: []}, plug: {Phoenix.Endpoint.SyncCodeReloadPlug, {HelioWeb.Endpoint, []}}, websocket_enabled: true}}
Yep; that's being worked upstream at https://github.com/phoenixframework/phoenix/issues/5445
In the meantime it's great to hear we've solved the original issue here; thanks for your help and input!
I'll wait a few days for @woylie to chime in before closing this issue.
ah I see, thanks for clarifying. Going to deploy this to staging env and let it soak for a few days, will report back status
Thank you for digging into this. I just updated Bandit to 1.0.0-pre.3, and we'll be monitoring this. I'll let you know if this occurs again, and if it doesn't occur again, you'll hear from me in a week.
Sorry, forgot about this. We didn't encounter that issue again since switching back to Bandit.
Great to hear! Thanks again for the original issue!
@ringofhealth the leaking NoRouteError upstream issue has been merged - this should be fixed in the next Phoenix release.
We're getting a lot of these errors:
Paired with:
This happens in a Phoenix application with thousand island 0.6.5 and bandit 0.7.4.
Any indication what could cause this?