Closed aselder closed 4 years ago
Yep, looks like you are correct to me. I think we need to hook into [:phoenix, :endpoint, :start]
and [:phoenix, :endpoint, :stop]
, and use those to start traces instead of spans. I've been too busy to do more than slightly manage things here. If you have some time to update the 1.5 integration to that effect that would be awesome.
@zachdaniel I'll give it a shot. Probably won't be until this weekend though.
No rush at all. Thank you for your time!
derp, i got all rushed on the router side of things and didn't hook into the endpoint's telemetry. you want to start/stop traces with the [:phoenix, :endpoint, :start : stop]
events that comes out via the phoenix-installer
# handle these events from your endpoint.ex
plug Plug.Telemetry, event_prefix: [:phoenix, :endpoint]
from your endpoint.
and more specifically, the missing code is the behavior from the 3 plugs: StartTrace, AddContext, FinishTrace
@aselder thanks for the kick in the butt, i'm fixing this and will have a PR by day's end
In the meantime, what can a new library user (me!) do to get this working on Phx 1.5?
@bessey you could manually add the start/end trace calls to your endpoint's telemetry. we still don't know if the endpoint's "stop" event is called when phx's "exception" event happens though
@bessey actually, if you could take the branch from #31 for a spin in your system and see if it works or not, we can iterate on whether error traces need to close in phx exception vs endpoint stop. either way the code on that branch is as good as it'll be for now, i think
Once we get confirmation on that, I'll merge the fix.
Just for others reference, 1.01 is out and fixes this https://github.com/spandex-project/spandex_phoenix/blob/master/CHANGELOG.md#v101-2020-07-12
Thanks @novaugust / @zachdaniel! Sorry I wasn't around on the weekend to support
@bessey n/w, weekends are for fun (i finally managed to make this a work priority, and flexed a weekday to work the weekend ;))
btw, possible issues that could still be in there:
It looks like filter traces is just not hooked up: https://github.com/spandex-project/spandex_phoenix/blob/b4f1a46cb4367e2f9a4c27551759acc597fc4252/lib/spandex_phoenix/telemetry.ex#L138
Thats not an option that start_trace
accepts or anything, so that function will need to be run there and checked before we start the trace.
As for errors not being reported, nothing jumps out at me.
It looks like filter traces is just not hooked up:
Wow. Yeah. That would explain it. Too many revisions, accidentally dropped it at ssome point. Sigh. PR incoming
Before I close this though, someone should confirm that errors are still being reported.
@aselder / @novaugust
confirmed bug in error handling code, PR is up. still testing if it's fixed or not
The bug in error code above is fixed, but I still don't see error traces. At the same time, I no longer see handler unattachment, so it's presumably just some flaw in the logic. Will keep thinking on it as time goes on. Others' contributions are welcome
I tried to upgrade the spandex phoenix integration for a 1.5.4 project today and found that error traces don't seem to be handled correctly; specifically the top span (request
) does not have any of the relevant tags set (also not the "resource name"), and non-2xx response codes don't seem to be treated as errors anymore.
I think the top span might be missing tags because the exception
handler currently calls tracer.finish_trace
before finish_trace
and customize_metadata
can be called...
I'm so strapped for time preparing for elixirconf that I really don't have time to look into this, but anyone who wants to is welcome. I will review/advise.
Another possible issue I noticed after rolling this out to our services (with the change in #37), is that the reported latency changed by up to 1ms, without any other changes introduced. This is definitely just a span-level reporting issue, since the latency reported on our ingress level was not affected at all.
I'm currently not sure where this is coming from; one potential source could be timing issues, and e.g. usage of the "current time" instead of measurements from the respective telemetry events?
i think #37 fixed the last outstanding problems w/ this issue and it can be closed
Nice! Great work everyone. 🚀 Thank you for your contribution! 🚀
Looking through the code, it only looks like
start_trace
is called from theStartTrace
plug and that is only initialized inside of__using__
in SpandexPhoenix.So if you remove
use SpandexPhoenix
like the instructions say, a trace will never be started and everything down stream will do nothing because there is no current trace id.