newrelic / elixir_agent

New Relic's Open Source Elixir Agent
https://hex.pm/packages/new_relic_agent
Apache License 2.0
254 stars 97 forks source link

Handler {:new_relic, :plug} has failed and has been detached. #447

Open camfeghali opened 1 week ago

camfeghali commented 1 week ago

The bug New Relic Plug handler fails to process certain telemetry events due to a :badkey error. This results in the handler being detached and further telemetry collection disrupted. I am using Bandit v1.5

Logs Both logs happen at the same timestamp

** (Bandit.HTTPError) Too many headers
Log: [error] Handler {:new_relic, :plug} has failed and has been detached. Class=:error
Reason: {:badkey, :conn,
%{
  telemetry_span_context: #Reference<0.1763085910.242745346.37981>,
  connection_telemetry_span_context: #Reference<0.1763085910.242745346.7102>
}}

Stacktrace:
[
  {NewRelic.Telemetry.Plug, :get_headers, 2,
   [file: "lib/new_relic/telemetry/plug.ex", line: 272]},
  {NewRelic.Telemetry.Plug, :handle_event, 4,
   [file: "lib/new_relic/telemetry/plug.ex", line: 102]},
  {:telemetry, :"execute/3-fun-0", 4,
   [file: "/app/deps/telemetry/src/telemetry.erl", line: 167]},
  {:lists, :foreach_1, 2, [file: "lists.erl", line: 1686]},
  {Bandit.Telemetry, :start_span, 3, [file: "lib/bandit/telemetry.ex", line: 168]},
  {Bandit.Pipeline, :run, 4, [file: "lib/bandit/pipeline.ex", line: 53]},
  {Bandit.HTTP1.Handler, :handle_data, 3,
   [file: "lib/bandit/http1/handler.ex", line: 12]},
  {Bandit.DelegatingHandler, :handle_data, 3,
   [file: "lib/bandit/delegating_handler.ex", line: 18]}
]

Environment

A assumption we're making is that the NR agent is a supervisor, and that its children should restart if they fail.

Any help would be much appreciated :)

binaryseed commented 3 days ago

It looks like Bandit is raising an error when reading headers because it's reached it's maximum.

I believe that eventually somewhere in your app you'd encounter the same error, it's just blowing up here first because we try to instrument things as early as possible, so we're the first to try and read headers.

It looks like you can configure Bandit to allow more headers with max_header_count

https://hexdocs.pm/bandit/Bandit.html#t:http_1_options/0

We can also rescue this particular error so this blows up elsewhere, instead of inside the agent

binaryseed commented 3 days ago

Hmm, actually, I think my interpretation is slightly off. Bandit does raise, but it looks like actually it's executing the telemetry callback without the expected conn inside the meta

binaryseed commented 3 days ago

Yeah, I think that's right

https://github.com/mtrudel/bandit/blob/main/lib/bandit/telemetry.ex#L24C4-L26C17

Not present in cases where error is also set and the nature of error is such that Bandit was unable to successfully build the conn

camfeghali commented 2 days ago

Hey @binaryseed thanks for getting back to me!

I enabled more logs, and saw that a (Bandit.HTTPError) schemeURI is not supported is also thrown. Could it be because someone is trying to connect using something that is not tcp ? I currently allow ingress on ports from 22 to 4000, which is causing Bandit to throw this error ?

binaryseed commented 2 days ago

I'm not sure about that, might want to investigate inside bandit to see what that's about