spandex-project / spandex_phoenix

Phoenix Instrumentation tracer
MIT License
82 stars 29 forks source link

Phoenix 1.5 integration broken? #29

Closed aselder closed 4 years ago

aselder commented 4 years ago

Looking through the code, it only looks like start_trace is called from the StartTrace plug and that is only initialized inside of __using__ in SpandexPhoenix.

So if you remove use SpandexPhoenix like the instructions say, a trace will never be started and everything down stream will do nothing because there is no current trace id.

zachdaniel commented 4 years ago

Yep, looks like you are correct to me. I think we need to hook into [:phoenix, :endpoint, :start] and [:phoenix, :endpoint, :stop], and use those to start traces instead of spans. I've been too busy to do more than slightly manage things here. If you have some time to update the 1.5 integration to that effect that would be awesome.

aselder commented 4 years ago

@zachdaniel I'll give it a shot. Probably won't be until this weekend though.

zachdaniel commented 4 years ago

No rush at all. Thank you for your time!

novaugust commented 4 years ago

derp, i got all rushed on the router side of things and didn't hook into the endpoint's telemetry. you want to start/stop traces with the [:phoenix, :endpoint, :start : stop] events that comes out via the phoenix-installer

# handle these events from your endpoint.ex
  plug Plug.Telemetry, event_prefix: [:phoenix, :endpoint]

from your endpoint.

novaugust commented 4 years ago

and more specifically, the missing code is the behavior from the 3 plugs: StartTrace, AddContext, FinishTrace

novaugust commented 4 years ago

@aselder thanks for the kick in the butt, i'm fixing this and will have a PR by day's end

bessey commented 4 years ago

In the meantime, what can a new library user (me!) do to get this working on Phx 1.5?

novaugust commented 4 years ago

@bessey you could manually add the start/end trace calls to your endpoint's telemetry. we still don't know if the endpoint's "stop" event is called when phx's "exception" event happens though

novaugust commented 4 years ago

@bessey actually, if you could take the branch from #31 for a spin in your system and see if it works or not, we can iterate on whether error traces need to close in phx exception vs endpoint stop. either way the code on that branch is as good as it'll be for now, i think

zachdaniel commented 4 years ago

Once we get confirmation on that, I'll merge the fix.

bessey commented 4 years ago

Just for others reference, 1.01 is out and fixes this https://github.com/spandex-project/spandex_phoenix/blob/master/CHANGELOG.md#v101-2020-07-12

Thanks @novaugust / @zachdaniel! Sorry I wasn't around on the weekend to support

novaugust commented 4 years ago

@bessey n/w, weekends are for fun (i finally managed to make this a work priority, and flexed a weekday to work the weekend ;))

novaugust commented 4 years ago

btw, possible issues that could still be in there:

  1. is filter traces working? i'm suppose to be filtering all OPTIONS requests yet they're appearing, dangit.
  2. are errors being reported? this is where i was most uncertain about things. currently not seeing any in my own apm
zachdaniel commented 4 years ago

It looks like filter traces is just not hooked up: https://github.com/spandex-project/spandex_phoenix/blob/b4f1a46cb4367e2f9a4c27551759acc597fc4252/lib/spandex_phoenix/telemetry.ex#L138

zachdaniel commented 4 years ago

Thats not an option that start_trace accepts or anything, so that function will need to be run there and checked before we start the trace.

zachdaniel commented 4 years ago

As for errors not being reported, nothing jumps out at me.

novaugust commented 4 years ago

It looks like filter traces is just not hooked up:

Wow. Yeah. That would explain it. Too many revisions, accidentally dropped it at ssome point. Sigh. PR incoming

zachdaniel commented 4 years ago

Before I close this though, someone should confirm that errors are still being reported.

zachdaniel commented 4 years ago

@aselder / @novaugust

novaugust commented 4 years ago

confirmed bug in error handling code, PR is up. still testing if it's fixed or not

novaugust commented 4 years ago

The bug in error code above is fixed, but I still don't see error traces. At the same time, I no longer see handler unattachment, so it's presumably just some flaw in the logic. Will keep thinking on it as time goes on. Others' contributions are welcome

bforchhammer commented 4 years ago

I tried to upgrade the spandex phoenix integration for a 1.5.4 project today and found that error traces don't seem to be handled correctly; specifically the top span (request) does not have any of the relevant tags set (also not the "resource name"), and non-2xx response codes don't seem to be treated as errors anymore.

I think the top span might be missing tags because the exception handler currently calls tracer.finish_trace before finish_trace and customize_metadata can be called...

zachdaniel commented 4 years ago

I'm so strapped for time preparing for elixirconf that I really don't have time to look into this, but anyone who wants to is welcome. I will review/advise.

bforchhammer commented 4 years ago

Another possible issue I noticed after rolling this out to our services (with the change in #37), is that the reported latency changed by up to 1ms, without any other changes introduced. This is definitely just a span-level reporting issue, since the latency reported on our ingress level was not affected at all.

I'm currently not sure where this is coming from; one potential source could be timing issues, and e.g. usage of the "current time" instead of measurements from the respective telemetry events?

novaugust commented 4 years ago

i think #37 fixed the last outstanding problems w/ this issue and it can be closed

zachdaniel commented 4 years ago

Nice! Great work everyone. 🚀 Thank you for your contribution! 🚀