Exceptions get logged from Firefox client, but not from cURL

DHager commented 5 months ago

This might be a "undefined behavior, you're using it wrong", but it's sufficiently weird that I want to check. We've been using cowboy2 for a bit and started to use bandit for some endpoints, and ran into an issue where a server-side exception sometimes does not get any error logging... depending on HTTP client.

For example, suppose I drop manual ArgumentError into my controller, and load up the page in Firefox. STDOUT shows a [debug] and an [error], which is in this context what I would expect:

10:56:26.244 application=phoenix request_id=F9WQNBLv85i-0ZwAAAFF line=248 file=lib/phoenix/logger.ex ip=127.0.0.1 [debug] Converted error ArgumentError to 500 response
10:56:26.246 request_id=F9WQNBLv85i-0ZwAAAFF line=1391 file=gen_server.erl ip=127.0.0.1 [error] GenServer #PID<0.1663.0> terminating
** (ArgumentError) Forced Example Exception
    (app 0.2.18) lib/foo/api/internal/controllers/bar_controller.ex:25: Acme.Foo.API.Internal.BarController.index/2
    (app 0.2.18) lib/foo/api/internal/controllers/bar_controller.ex:1: Acme.Foo.API.Internal.BarController.action/2
    (app 0.2.18) lib/foo/api/internal/controllers/bar_controller.ex:1: Acme.Foo.API.Internal.BarController.phoenix_controller_pipeline/2
    (phoenix 1.7.3) lib/phoenix/router.ex:430: Phoenix.Router.__call__/5
    (app 0.2.18) lib/foo/api/internal/endpoint.ex:1: Acme.Foo.API.Internal.Endpoint.plug_builder_call/2
    (app 0.2.18) lib/foo/api/internal/endpoint.ex:1: Acme.Foo.API.Internal.Endpoint.call/2
    (bandit 1.4.1) lib/bandit/pipeline.ex:150: Bandit.Pipeline.call_plug!/2
    (bandit 1.4.1) lib/bandit/pipeline.ex:36: Bandit.Pipeline.run/4
    (bandit 1.4.1) lib/bandit/http2/stream_process.ex:28: Bandit.HTTP2.StreamProcess.handle_continue/2
    (stdlib 5.2) gen_server.erl:1085: :gen_server.try_handle_continue/3
    (stdlib 5.2) gen_server.erl:995: :gen_server.loop/7
    (stdlib 5.2) proc_lib.erl:241: :proc_lib.init_p_do_apply/3
Last message: {:continue, :start_stream}
State: {%Bandit.HTTP2.Stream{connection_pid: #PID<0.1662.0>, stream_id: 15, state: :idle, recv_window_size: 65535, send_window_size: 131072, bytes_remaining: nil, transport_info: %Bandit.TransportInfo{secure?: true, sockname: {{127, 0, 0, 1}, 4004}, peername: {{127, 0, 0, 1}, 53344}, peercert: nil}, read_timeout: 15000}, {Acme.Foo.API.Internal.Endpoint, []}, %ThousandIsland.Telemetry{span_name: :connection, telemetry_span_context: #Reference<0.907864227.2392326150.249420>, start_time: -576460721075489802, start_metadata: %{remote_port: 53344, remote_address: {127, 0, 0, 1}, telemetry_span_context: #Reference<0.907864227.2392326150.249420>, parent_telemetry_span_context: #Reference<0.907864227.2392326150.249211>}}, %{http: [], websocket: [], http_1: [], http_2: []}}

However when I use "Copy as cURL" and run that (even including Firefox user-agent headers) the [debug] is still there, and the client gets the expected result, but there is no [error] anymore. This difference exists even if I strip down the curl command to the basics.

I did find this 1.4.0 changelog entry (from #286) which may be relevant:

We no longer log if processes that are linked to an HTTP/2 stream process terminate unexpectedly. This has always been unspecified behaviour so is not considered a breaking change

Questions:

What causes this difference? Perhaps something about how the different clients negotiate HTTP/2 that affects genserver lifetimes?
For anyone who was relying on "unspecified behavior", what route should take care of unexpected exceptions? Is the assumption that consumers must implement their own Logger.error calls inside a Phoenix ErrorView module?

mtrudel commented 5 months ago

Assuming they're both coming via http/2 it could be any number of things. Since you're raising this as an error from your controller, the changes in 1.5.0 are probably relevant to your interests (it looks like you're using 1.4.2 above).

I'd be interested in seeing a repro of this! There's no end of interesting edge cases coming from Firefox (not 'wrong', just 'different').

DHager commented 5 months ago

Since you're raising this as an error from your controller, the changes in 1.5.0 are probably relevant to your interests (it looks like you're using 1.4.2 above).

At the moment I'm more concerned about logging unpredictable and arbitrary exceptions: If an undreamt-of CthulhuException rises from the stack-abyss to topple the production endpoints of man, we should at least get an alert e-mail about it. :p

mtrudel commented 5 months ago

CthulhuExceptions should be being logged in all cases (1.4.2 and 1.5.0+) but considering that the main focus of the 1.5.0 release was around logging of exceptions (and, if there's a bug fix to come from this, it'll be against the current version), it'd be much easier to establish a repro on the common ground of the current release.

DHager commented 5 months ago

I'll keep poking at it, but it does seem to happen with 1.5.2 as well.

* bandit 1.5.2 (Hex package) (mix)
* hpax 0.1.2 (Hex package) (mix)
* plug 1.16.0 (Hex package) (mix)
* telemetry 1.2.1 (Hex package) (rebar3)
* thousand_island 1.3.5 (Hex package) (mix)

mtrudel commented 5 months ago

Minimal repro case is raising an ArgumentError inside a phoenix controller and accessing it via Firefox / curl? How are you realizing http/2 connections? HTTPS? Proxy server?

DHager commented 5 months ago

I think this may be a race-condition where the process serving the request is getting terminated--silently!--after it sends data to the client but before it can finish logging properly. Perhaps there's some kind of bug in some code somewhere that supervises things and determines how long to spend on requests?

The code where it happens (on my box)

In my declining sanity, I've resorted to brute force of putting in IO.puts/1 calls inside Bandit.Pipeline.handle_error/5, like so:

defp handle_error(error, stacktrace, transport, span, opts) do
    #IEx.pry()
    IO.puts(">>>HANDLE_ERROR_2")
    Bandit.Telemetry.span_exception(span, :exit, error, stacktrace)
    IO.puts(">>>HANDLE_ERROR_2B")
    status = error |> Plug.Exception.status() |> Plug.Conn.Status.code()
    IO.puts(">>>HANDLE_ERROR_2C")
    if status in Keyword.get(opts.http, :log_exceptions_with_status_codes, 500..599) do
      IO.puts(">>>HANDLE_ERROR_ERRSTATUS_1")
      Logger.error(Exception.format(:error, error, stacktrace))
      IO.puts(">>>HANDLE_ERROR_ERRSTATUS_2")
      Bandit.HTTPTransport.send_on_error(transport, error)
      IO.puts(">>>HANDLE_ERROR_ERRSTATUS_3")
      {:error, error}
    else
      IO.puts(">>>HANDLE_ERROR_OTHERSTATUS_1")
      Bandit.HTTPTransport.send_on_error(transport, error)
      IO.puts(">>>HANDLE_ERROR_OTHERSTATUS_2")
      {:ok, transport}
    end
  end

Naturally, this requires some extra steps like mix deps.compile bandit to put the local changes into effect.

What gets printed

When making requests with Firefox or Chrome, this handler function completes entirely, and I also get the expected error logging. However with cURL, I get varying amounts of output! For example, sometimes it's as short as:

>>>HANDLE_ERROR_2
>>>HANDLE_ERROR_2B

While other times it gets a bit further:

>>>HANDLE_ERROR_2
>>>HANDLE_ERROR_2B
>>>HANDLE_ERROR_2C
>>>HANDLE_ERROR_ERRSTATUS_1

Possibly if I try it enough times, it'll eventually get further and log the error?

Interactive with `IEx.pry/1`

Additional evidence for "the process is being killed" comes from using an IEx.pry/1 call at the start of the method and starting up with iex -S mix phx.server.

With Firefox/Chrome triggering, everything works as expected, and I am able to confirm the prompt and dig around and them respawn to continue execution.

However in the buggy cause triggered with curl, the process is dead before I can do anything:

Request to pry #PID<0.2218.0> at Bandit.Pipeline.handle_error/5 (deps/bandit/lib/bandit/pipeline.ex:229)
  226:   end
  227: 
  228:   defp handle_error(error, stacktrace, transport, span, opts) do
  229:     IEx.pry()
  230:     IO.puts(">>>HANDLE_ERROR_2")
  231:     Bandit.Telemetry.span_exception(span, :exit, error, stacktrace)
  232:     IO.puts(">>>HANDLE_ERROR_2B")
Allow? [Yn] y

Interactive Elixir (1.15.7) - press Ctrl+C to exit (type h() ENTER for help)
** (EXIT from #PID<0.2218.0>) shell process exited with reason: no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started

Other

Minimal repro case is raising an ArgumentError inside a phoenix controller and accessing it via Firefox / curl?

Yes. Or at least, I haven't identified any other critical conditions yet. That said, calling curl multiple times shows variability in how much works and how much fails.

How are you realizing http/2 connections? HTTPS? Proxy server?

In this case it's all on localhost, no proxies or anything. Our application has multiple endpoints on different ports, and for the newer ones being developed we're using bandit instead of cowboy2.

mtrudel commented 5 months ago

As discussed with @alisinabh, this is likely due to the fact that curl shuts down the HTTP2 connection as soon as its one request is complete, where Firefox will keep the connection open. Because of the way that the Bandit HTTP/2 process model works, the parent connection process being terminated causes the child request process to terminate before it has a chance to complete its logging.

I'll look into getting the connection process join on any living children for some amount of time before shutting them down.

DHager commented 5 months ago

Sounds good, just adding another data-point to confirm: If I call curl with additional URLs on the same invocation, then the issue only shows up if the problem-URL is the last one listed. I assume this is because curl is keeping its connections while it works on the URLs, since a later URL may benefit from a connection established previously.

That said, this behavior could potentially occur with other tools or libraries.

mtrudel commented 5 months ago

Thanks for that - that just about confirms my hypothesis.

There's nothing 'wrong' here from a safety or a code correctness perspective, this is just a classic race condition. Should be pretty easy to mitigate.

Does the condition happen 100% of the time with curl?

DHager commented 5 months ago

Does the condition happen 100% of the time with curl?

To invert the saying: "It never works on my machine." However someone with a slow network or slow client-computer might not trigger the issue, since it gets very close to the logging statement before termination.

I can repro the problem with Chrome by:

Inside Pipeline.handle_error/5 and immediately before the Logger.error/1 line, add :thread.sleep(5000). This widens the time-window to a comfortable 5 seconds. (Recompile dependencies, relaunch server, etc.)
Open up the browser with only one window/tab.
Navigate to the problematic URL.
Immediately (i.e. within 5 seconds) close the browser, indirectly forcing it to terminate the connection sooner than it would normally choose to.
Wait out the remaining time, and observe that there's no log output on the server.

Conversely, skipping step 4 will cause the log entry to appear on that 5-second delay.

adonig commented 3 months ago

I'm not sure whether this is related. I run Phoenix 1.7.14 and get an error when I access my endpoint with a browser:

[info] Sent 200 in 516ms
[error] ** (Bandit.HTTPError) closed

This doesn't happen when I call it with CURL.

mtrudel commented 2 weeks ago

I have a fix for this that allows HTTP/2 stream processes to continue running while the underlying connection is being / has been terminated, but considering it in light of #399 / #406 / #408 / others, I think it's actually moving in the wrong direction. We should be moving in the direction of terminating stream processes as soon as possible after the client disconnects, not leaving them open longer (the inability to do this effectively on the HTTP/1 side is the root of all of those issues above). Fundamentally, this race condition is a side effect of this termination pattern which I think is a 'good' thing even if it's somewhat non-deterministic by nature.

Closing for hygiene. If you think there's something more to this, or have any other issues, please feel free to reopen as needed!

DHager commented 1 week ago

@mtrudel My primary concern is avoiding this scenario:

Someone makes a web-service with bandit.
One day something goes seriously wrong and it partly or entirely stops working.
Due to this race-condition, there no errors are logged, so they are not notified, and after customer complaints pile up they have no hint at what's breaking.

mtrudel / bandit