mtrudel / bandit

Bandit is a pure Elixir HTTP server for Plug & WebSock applications
MIT License
1.67k stars 80 forks source link

Missing measurements in telemetry events #340

Closed danschultzer closed 4 months ago

danschultzer commented 5 months ago

Hey @mtrudel,

We're looking to update opentelemetry_bandit to support for Bandit 1.4 breaking changes in https://github.com/open-telemetry/opentelemetry-erlang-contrib/pull/311 and measurements are missing from some [:bandit, :request, :stop] telemetry events.

This happened for 404s with a Phoenix app.

Looking up Bandit.Telemetry.stop_span calls there are two places where measurements are not set:

https://github.com/mtrudel/bandit/blob/4f15f029e7aa17f8e7f98d55b0e94c684dee0971/lib/bandit/pipeline.ex#L54 https://github.com/mtrudel/bandit/blob/4f15f029e7aa17f8e7f98d55b0e94c684dee0971/lib/bandit/pipeline.ex#L74

In't this incorrect as Bandit.Telemetry explicitly states that there will be the above measurements in [:bandit, :request: stop] events?

https://github.com/mtrudel/bandit/blob/4f15f029e7aa17f8e7f98d55b0e94c684dee0971/lib/bandit/telemetry.ex#L30-L59

I haven't verified whether the 404 is getting into the latter exception handling, but I suspect it does.

danschultzer commented 5 months ago

Actually we don't see the exceptions happening in our use case. It's weird, it seems that the conn/conn.adapter doesn't persist with the measurement, even though send_resp gets called so it should have been set. It happens for the 404, but haven't had the time to grok the whole code flow. Bottomline, when stop_span gets called in the Bandit.Pipeline the measurement is lost.

mtrudel commented 5 months ago

The error case[^1] telemetry events won't have many / most of the measurements, this is ~more or less by design since it simplifies the mainline code so much. That being said, there should be some there; this may well be a bug.

[^1]: Error case meaning anything that Bandit returns internally (stream errors, semantic errors, etc). Any response that your plug sets on its own is considered 'normal' by Bandit, regardless of its response code.

danschultzer commented 5 months ago

Yeah, it makes sense there is not really any measurements for errors since it means it halted somewhere halfway so what are those measurements even good for. But shouldn't that be a [:bandit, :request, :exception] instead of [:bandit, :request, :stop] event then?

In any case that wouldn't address the issue we see in the otel library though, that's a 404 that is handled by Phoenix (no exception raised to Bandit), but somehow the measurements get lost. I'll see if I can make a reproducable example for you when I have some free time.

mtrudel commented 4 months ago

I'd love a repro for this if possible @danschultzer !

mtrudel commented 4 months ago

Looking into the pipeline code in more detail now that the post-1.4 world has settled down a bit and I think the only open question may be whether protocol errors should be exposed as a stop event or an exception event here:

https://github.com/mtrudel/bandit/blob/661596f157e208e921ea5b578d62ca32cd44934d/lib/bandit/pipeline.ex#L209

I'm open to either option, and am happy to take guidance from the telemetry folks on this

mtrudel commented 4 months ago

Closing due to inactivity. As I mentioned above, I'm happy to hew telemetry for these cases based on whatever otel folks want. Once / if someone has a concrete proposal just open a ticket up and I'll do my best to make it happen