Open DHager opened 3 months ago
Assuming they're both coming via http/2 it could be any number of things. Since you're raising this as an error from your controller, the changes in 1.5.0 are probably relevant to your interests (it looks like you're using 1.4.2 above).
I'd be interested in seeing a repro of this! There's no end of interesting edge cases coming from Firefox (not 'wrong', just 'different').
Since you're raising this as an error from your controller, the changes in 1.5.0 are probably relevant to your interests (it looks like you're using 1.4.2 above).
At the moment I'm more concerned about logging unpredictable and arbitrary exceptions: If an undreamt-of CthulhuException
rises from the stack-abyss to topple the production endpoints of man, we should at least get an alert e-mail about it. :p
CthulhuException
s should be being logged in all cases (1.4.2 and 1.5.0+) but considering that the main focus of the 1.5.0 release was around logging of exceptions (and, if there's a bug fix to come from this, it'll be against the current version), it'd be much easier to establish a repro on the common ground of the current release.
I'll keep poking at it, but it does seem to happen with 1.5.2 as well.
* bandit 1.5.2 (Hex package) (mix)
* hpax 0.1.2 (Hex package) (mix)
* plug 1.16.0 (Hex package) (mix)
* telemetry 1.2.1 (Hex package) (rebar3)
* thousand_island 1.3.5 (Hex package) (mix)
Minimal repro case is raising an ArgumentError
inside a phoenix controller and accessing it via Firefox / curl? How are you realizing http/2 connections? HTTPS? Proxy server?
I think this may be a race-condition where the process serving the request is getting terminated--silently!--after it sends data to the client but before it can finish logging properly. Perhaps there's some kind of bug in some code somewhere that supervises things and determines how long to spend on requests?
In my declining sanity, I've resorted to brute force of putting in IO.puts/1
calls inside Bandit.Pipeline.handle_error/5
, like so:
defp handle_error(error, stacktrace, transport, span, opts) do
#IEx.pry()
IO.puts(">>>HANDLE_ERROR_2")
Bandit.Telemetry.span_exception(span, :exit, error, stacktrace)
IO.puts(">>>HANDLE_ERROR_2B")
status = error |> Plug.Exception.status() |> Plug.Conn.Status.code()
IO.puts(">>>HANDLE_ERROR_2C")
if status in Keyword.get(opts.http, :log_exceptions_with_status_codes, 500..599) do
IO.puts(">>>HANDLE_ERROR_ERRSTATUS_1")
Logger.error(Exception.format(:error, error, stacktrace))
IO.puts(">>>HANDLE_ERROR_ERRSTATUS_2")
Bandit.HTTPTransport.send_on_error(transport, error)
IO.puts(">>>HANDLE_ERROR_ERRSTATUS_3")
{:error, error}
else
IO.puts(">>>HANDLE_ERROR_OTHERSTATUS_1")
Bandit.HTTPTransport.send_on_error(transport, error)
IO.puts(">>>HANDLE_ERROR_OTHERSTATUS_2")
{:ok, transport}
end
end
Naturally, this requires some extra steps like mix deps.compile bandit
to put the local changes into effect.
When making requests with Firefox or Chrome, this handler function completes entirely, and I also get the expected error logging. However with cURL, I get varying amounts of output! For example, sometimes it's as short as:
>>>HANDLE_ERROR_2
>>>HANDLE_ERROR_2B
While other times it gets a bit further:
>>>HANDLE_ERROR_2
>>>HANDLE_ERROR_2B
>>>HANDLE_ERROR_2C
>>>HANDLE_ERROR_ERRSTATUS_1
Possibly if I try it enough times, it'll eventually get further and log the error?
IEx.pry/1
Additional evidence for "the process is being killed" comes from using an IEx.pry/1
call at the start of the method and starting up with iex -S mix phx.server
.
With Firefox/Chrome triggering, everything works as expected, and I am able to confirm the prompt and dig around and them respawn
to continue execution.
However in the buggy cause triggered with curl, the process is dead before I can do anything:
Request to pry #PID<0.2218.0> at Bandit.Pipeline.handle_error/5 (deps/bandit/lib/bandit/pipeline.ex:229)
226: end
227:
228: defp handle_error(error, stacktrace, transport, span, opts) do
229: IEx.pry()
230: IO.puts(">>>HANDLE_ERROR_2")
231: Bandit.Telemetry.span_exception(span, :exit, error, stacktrace)
232: IO.puts(">>>HANDLE_ERROR_2B")
Allow? [Yn] y
Interactive Elixir (1.15.7) - press Ctrl+C to exit (type h() ENTER for help)
** (EXIT from #PID<0.2218.0>) shell process exited with reason: no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
Minimal repro case is raising an ArgumentError inside a phoenix controller and accessing it via Firefox / curl?
Yes. Or at least, I haven't identified any other critical conditions yet. That said, calling curl multiple times shows variability in how much works and how much fails.
How are you realizing http/2 connections? HTTPS? Proxy server?
In this case it's all on localhost, no proxies or anything. Our application has multiple endpoints on different ports, and for the newer ones being developed we're using bandit instead of cowboy2.
As discussed with @alisinabh, this is likely due to the fact that curl shuts down the HTTP2 connection as soon as its one request is complete, where Firefox will keep the connection open. Because of the way that the Bandit HTTP/2 process model works, the parent connection process being terminated causes the child request process to terminate before it has a chance to complete its logging.
I'll look into getting the connection process join on any living children for some amount of time before shutting them down.
Sounds good, just adding another data-point to confirm: If I call curl
with additional URLs on the same invocation, then the issue only shows up if the problem-URL is the last one listed. I assume this is because curl is keeping its connections while it works on the URLs, since a later URL may benefit from a connection established previously.
That said, this behavior could potentially occur with other tools or libraries.
Thanks for that - that just about confirms my hypothesis.
There's nothing 'wrong' here from a safety or a code correctness perspective, this is just a classic race condition. Should be pretty easy to mitigate.
Does the condition happen 100% of the time with curl?
Does the condition happen 100% of the time with curl?
To invert the saying: "It never works on my machine." However someone with a slow network or slow client-computer might not trigger the issue, since it gets very close to the logging statement before termination.
I can repro the problem with Chrome by:
Inside Pipeline.handle_error/5
and immediately before the Logger.error/1
line, add :thread.sleep(5000)
. This widens the time-window to a comfortable 5 seconds. (Recompile dependencies, relaunch server, etc.)
Open up the browser with only one window/tab.
Navigate to the problematic URL.
Immediately (i.e. within 5 seconds) close the browser, indirectly forcing it to terminate the connection sooner than it would normally choose to.
Wait out the remaining time, and observe that there's no log output on the server.
Conversely, skipping step 4 will cause the log entry to appear on that 5-second delay.
I'm not sure whether this is related. I run Phoenix 1.7.14 and get an error when I access my endpoint with a browser:
[info] Sent 200 in 516ms
[error] ** (Bandit.HTTPError) closed
This doesn't happen when I call it with CURL.
This might be a "undefined behavior, you're using it wrong", but it's sufficiently weird that I want to check. We've been using
cowboy2
for a bit and started to usebandit
for some endpoints, and ran into an issue where a server-side exception sometimes does not get any error logging... depending on HTTP client.For example, suppose I drop manual
ArgumentError
into my controller, and load up the page in Firefox. STDOUT shows a[debug]
and an[error]
, which is in this context what I would expect:However when I use "Copy as cURL" and run that (even including Firefox user-agent headers) the
[debug]
is still there, and the client gets the expected result, but there is no[error]
anymore. This difference exists even if I strip down thecurl
command to the basics.I did find this 1.4.0 changelog entry (from #286) which may be relevant:
Questions:
Logger.error
calls inside a PhoenixErrorView
module?