open-telemetry / opentelemetry-erlang-contrib

OpenTelemetry instrumentation for Erlang & Elixir
https://opentelemetry.io
Apache License 2.0
153 stars 112 forks source link

Thousand Island instrumentation #136

Open bryannaegele opened 1 year ago

bryannaegele commented 1 year ago

With Bandit now having full support for being a drop-in cowboy replacement for Phoenix, we'll need an instrumentation library for Thousand Island from its telemetry.

mtrudel commented 1 year ago

Note that thousand island's telemetry model is being overhauled to better accommodate exactly this scenario. See provisional work at https://github.com/mtrudel/thousand_island/pull/27 (which itself is largely blocked by knowing what the golden path is for exactly this integration).

tsloughter commented 1 year ago

I'm not sure how instrumentation of the socket level can actually be correlated with the application level.

Any span created in ThousandIsland will be ignored when the HTTP request's propagated context is extracted.

We'll want to create metrics from these events, of course, but I don't think they can be used for spans.

mtrudel commented 1 year ago

My really early take on this at the HTTP layer is up at https://github.com/mtrudel/bandit/tree/telemetry; generally I was planning to

  1. Have Bandit reach into the underlying ThousandIsland.Socket to get the connection span to use as a parent:
  2. Have Bandit create spans / events as appropriate for each HTTP request & the various phases of a request (header & body reading, response writing, etc).
  3. Put the relevant span into the Plug.Conn struct for a request (likely as a private) and somehow get phoenix to nest its spans within this if present.

Truthfully, I'm not at all confident I'm not overbuilding this. The time lookups alone are pretty expensive to be doing so many of them. Happy to take guidance / advice from folks more versed in telemetry than I.

tsloughter commented 1 year ago

Have Bandit reach into the underlying ThousandIsland.Socket to get the connection span to use as a parent:

Yea, that is an issue since Bandit, or Plug/Phoenix, needs to create a span from a parent that is received in request headers. It'll have its own trace id, so can't have a parent/child relationship with anything from ThousandIsland.

The only option I know of for such a relationship would be that Bandit could add the ThousandIsland span as a linked span if it is what starts the span from the propagated headers.

mtrudel commented 1 year ago

Interesting! Alright, so if I understand correctly:

If that's the case, then I think the only open questions that remain for me are those at https://github.com/mtrudel/thousand_island/pull/27#issuecomment-1340068486. Pending clarity on them, I should be able to land sensible telemetry support in Bandit / TI in short order, which would unblock this. In light of that, do you have any advice for those questions, specifically how I can make your life here as easy as possible?

bryannaegele commented 1 year ago

Hey, Mat. If you have a look at how cowboy/phoenix telemetry/otel integration works, we need something that kicks off and initiates at the socket level and then end at that same level so we get accurate total times. For cowboy + phoenix (unreleased) we're just continuing to add info the a span started in otel cowboy. The same would be true with bandit/thousand island.

I only briefly looked at what was available telemetry wise several weeks ago and only saw thousand island events which is why I started this issue. If it belongs in bandit, totally cool. The key thing we'd all want here is to start and stop a span as closely to or in the socket request handler as possible. For cowboy, there's a connection handler pool where the span gets started and then we're reaching into that process to get pull that trace context into the phoenix process and just extend it.

Basically need the equivalent of otel cowboy for thousand island/bandit

tsloughter commented 1 year ago

Important to note it is started somewhere that has the headers, https://github.com/open-telemetry/opentelemetry-erlang-contrib/blob/main/instrumentation/opentelemetry_cowboy/src/opentelemetry_cowboy.erl#L32 which is why i assume it needs to be in Bandit.

SteffenDE commented 7 months ago

I‘m currently in the process of moving from cowboy to bandit for my phoenix setup. Is there anything that’s blocking this? https://github.com/mtrudel/bandit/blob/main/lib/bandit/telemetry.ex seems like it should include the necessary events.

mtrudel commented 7 months ago

Not on my side. Bandit's telemetry payloads have been solid since 1.0. There are gonna be a few minor changes to metadata soon but nothing substantial

bryannaegele commented 7 months ago

There are some developments here but I have not been able to get a response. I'd like to get things moved forward and build on the work they did since they've used the namespace now rather than having to start from scratch to have a more complete lib.

bryannaegele commented 6 months ago

@mtrudel when looking through the bandit code and the instrumenter here that was written, there's a critical change that we need to get into bandit. Currently, the start event on a request happens prior to the request being read. The full request is not available until the end which makes it too late to extract trace propagation headers and make sampling decisions which must happen at the start of the trace, leaving incomplete context for the rest of the request.

There are a couple of things that we need to change to get parity with the cowboy instrumentation.

  1. The :start telemetry event must happen after the request is read and must provide the entire request as metadata.
  2. You can assign the start time of the request to a variable where the start event currently is and add that to the metadata.
  3. If you want to keep an event firing there, I'd suggest adding an event with :init instead of start.
mtrudel commented 6 months ago

That should be doable! Any other metadata you want in the start event?

bryannaegele commented 6 months ago

I think that should be all that's needed. I feel like that object looked pretty thorough when I perused it.

As a reference point, cowboy shows the bare minimum we'd like to see. https://github.com/open-telemetry/opentelemetry-erlang-contrib/blob/main/instrumentation/opentelemetry_cowboy/src/opentelemetry_cowboy.erl#L31-L50

mtrudel commented 6 months ago

This is actually proving to be a bit more of a plumbing change than I was expecting. I'm planning on landing the refactors in https://github.com/mtrudel/bandit/issues/297 et al during my company hack week later on this month; I'll make sure these changes get landed as part of that work

bryannaegele commented 6 months ago

Thanks for the update. I had seen you started that PR and the refactor looked like it was going to make it more difficult than just sliding down to here

mtrudel commented 6 months ago

The main problem is that the conn struct is what I want to expose; I quite specifically don't want to expose any of Bandit's internal req structures in the metadata. But... the conn only gets built inside the Pipeline module so I can't easily get to it until / unless I tear apart Pipeline, which is part of what I'll be doing in 297 et al.

By the end of that work there's going to be a single consolidated pipeline for HTTP request/responses, and HTTP1/2 are going to be thin behaviour implementations that only know about the lowest levels of transport. All the shared HTTP semantics (including telemetry!) will be shared between them, so there will only be one place to change this sort of thing.

Expect an update early in the week of Mar 25

mtrudel commented 5 months ago

Bandit 1.4.1 just went out the door with support for conns in the metadata of HTTP start events. I think that's all you need from me to get this rolling?

Let me know if anything else is needed!