open-telemetry / opentelemetry-erlang-api

Erlang/Elixir OpenTelemetry API
Apache License 2.0
60 stars 13 forks source link

`end_span` should take the value returned by `start_span` (or part of it) #26

Closed hauleth closed 3 years ago

hauleth commented 4 years ago

Currently to be sure that we end the same span as we started we need to do:

_ = ot_tracer:start_span(Tracer, "name"),
Ctx = ot_tracer:current_ctx(Tracer),
% Code
_ = ot_tracer:end_span(Tracer, Ctx)

This is unintuitive. Instead it should be like:

Ctx = ot_tracer:start_span(Tracer, "name"),
% Code
_ = ot_tracer:end_span(Tracer, Ctx)

Alternatively (to have plain access to the span_ctx for setting additional values):

{TraceCtx, SpanCtx} = ot_tracer:start_span(Tracer, "name"),
% Code
_ = ot_tracer:end_span(Tracer, TraceCtx)

This will additionally simplify the ot_trace:with_span/{3, 4} and allow to make it fast and seamless within the ot_trace instead of dispatching it to the SDK:

with_span(Tracer, Name, Opts, Fun) ->
  {TraceCtx, SpanCtx} = ot_tracer:start_span(Tracer, Name, Opts),
  try Fun(SpanCtx)
  after ot_tracer:end_span(Tracer, TraceCtx).
garthk commented 4 years ago

Yeah, this just tripped me up.

explain """
No, wait. Am I using OT properly? Let's start a span:
"""

loudly do
  previous_span_ctx = :ot_tracer.start_span(tracer, "outer parent", %{parent: :undefined})
  span_ctx = :ot_tracer.start_span(tracer, "benchmark1", %{parent: previous_span_ctx})
end

explain """
... and then start a new one, which we'll pretend is for some code that crashed out without
tidying up:
"""

loudly do
  ^span_ctx = :ot_tracer.current_span_ctx(tracer)
  inner_span_ctx = :ot_tracer.start_span(tracer, "benchmark2", %{parent: span_ctx})
end

explain """
Put back our span_ctx, end_span, and we'll be back at our previous_span_ctx, right?
"""

loudly do
  :ot_tracer.set_span(tracer, span_ctx)
  :ot_tracer.end_span(tracer)

  # desired: I got the span I just ended's parent
  # assert :ot_tracer.current_span_ctx(tracer) == previous_span_ctx

  # not desired: I got the previous argument to some hidden set_span:
  assert :ot_tracer.current_span_ctx(tracer) == inner_span_ctx
end

… TL;DR I need to carry around previous_span_ctx as well as span_ctx. Bummer.

tsloughter commented 4 years ago

We have to decide what the purpose of ending a specific span meets and that should decide what path to take.

If start_span returns a span context and a user passes that to end_span what should happen? Is the current context updated to pop that span from the stack and make the element before it the current span?

What happens if you pass a span that isn't the top of the stack? What happens with the current context?

I think it needs to only be possible if the SpanCtx is for an inactive span and I'm not sure how to enforce that. Maybe start_inactive_span should return the TraceCtx and SpanCtx? Unlike start_span that can return just SpanCtx and not have the ability to end the span outside of using the current context.

Also keep in mind that Processors are kept in the tracer, so the tracer is needed to end a span.

tsloughter commented 4 years ago

@garthk looks like your example is what I was talking about in my last comment about "what happens to the current context and it stack".

You had a span_ctx that is part of the context already, so in the stack, next create another span (also on the stack) and then set the span_ctx from before to the current context with set_span which simply puts another copy of it on the stack.

If you wanted to go back to the previous context state then you'd need the full context and to set it to current context.

I think an important disconnect to document is that the span that becomes current when ending a span is not necessarily the parent of the span you are ending, it is the span on the stack.

garthk commented 4 years ago

Consider hookups via :telemetry, on which I've been experimenting hence the code above. Say we've got cowboy → → Plug → Phoenix → controller code → hackney, and want a span for each. To make it harder, but not unrealistic, let's say the maintainers of all the projects want to use :telemetry as their only point of integration for concerns like ours, and at least one of them emits only :start and :stop.

At run-time, an outbound request returns data in an unexpected shape, we get a function clause exception handling it, and someone further up catches it eg. so they can render a 500 page.

What do we do in our event handlers to ensure we end spans correctly? How do we make it easy enough that developers who aren't as involved with this as we are can write and maintain their own such event handlers?

The stacked context feels clever when the only use of set_span is to push further into the code, but I'm getting nervous about popping back out. I can see how I could reach around the tracer abstraction and use the context directly, but that'll tie my code to this particular tracer.

Is there a formal “get full context” and “put full context” API in the Erlang code I missed because it wasn't wrapped in the Elixir code?

tsloughter commented 4 years ago

Yea, this is why I don't like telemetry for tracing. Works fine for metrics or for adding metadata on the currently active span but it just didn't fit well for the way spans are tracked in the default context of otel.

There is a current_ctx function on the tracer that I don't know if it should exist. There isn't, but probably should be if current_ctx remains, a corresponding set_ctx in the tracer.

And I guess you can't usestart_inactive_span because telemetry can't return anything.

So I don't have a good answer and just think this is a clear example of why telemetry events are a bad abstraction for layering on top of spans and context.

I know there was an event based Go implementation at one point and still might be somewhere in either otel or OC. It only sends events to some backend and doesn't update anything in memory about spans.

I think what would work naturally for telemetry is a similar event based model but since your handlers can't set or return variables you don't really have a way of know what span context was created so you can track it.

tsloughter commented 4 years ago

Oh, but if ending spans is the main concern then maybe the monitor process will solve that for you? Where when you start a span in a telemetry handler you can set it to be monitored and ended if that process goes down.

tsloughter commented 4 years ago

I'd be curious to know how what you are doing looks in other language. My guess is that they all have scoping for their thread locals that makes it possible. We just don't have the option in Erlang. They way to have gotten around this would be to go Go's route of making everyone change their function definitions to accept a context, but that isn't realistic for Erlang, Go is the only one that has done this and the only one with a community I think it is even possible to force such a rule on.

tsloughter commented 4 years ago

Realized that even with scopes you'd potentially be in a similar situation.

You are 3 scopes deep and then set the span from the 2nd scope to the current span doesn't lead to breaking out of those existing scopes but instead creates a new one.

garthk commented 4 years ago

I'm a big fan of proof by induction. If you're in a ridiculous situation, one of your priors is wrong: back up and reconsider.

The problem with reconsidering “use :telemetry to hook up tracing” is that it's our only way to integrate with a number of very popular packages that don't, to me, show any signs of enthusiasm for adding anything else. If we can't use :telemetry to hook up tracing, we're going to lock ourselves out of a chunk of the ecosystem. People are going to stick with the choices they already made.

Backing up further… the stack sounded awesome, but it's turning out to make it hard to ship adapters for packages that use :telemetry. OC doesn't have the stack, works fine, and doesn't have these problems. What's the stack buying us, again? Is it worth the cost?

tsloughter commented 4 years ago

OC does have the stack.. I'm not sure how you are using it that you aren't using the stack?

And :telemetry can still be used for metrics but there might be better integrations for libraries to get traces.

tsloughter commented 4 years ago

Are you using telemetry and OC?

Also, how would handle:

telemetry:execute([some, start]),
telemetry:execute([another, start]),

Being able to have the the second span set the first span as its parent?

garthk commented 4 years ago

OC has the stack? Oh, that might explain some weird traces I’ve seen. I’ve been copying the code from ocp:with_child_span/3 basically and adapting to each scenario. Never occurred to me with_span_ctx/1 might not perfectly put back a previous value of current_span_ctx/0.

I’ve been meaning to apply property based testing to this, planning and executing a random call stack including exceptions and checking the trace I get out matches the plan. Might need to.

garthk commented 4 years ago

Back to telemetry (sorry on my phone can’t see it all at once) I’m still using OC’s process dictionary and nesting seems to be working as long as it’s all within the same process but I’m worried about exceptions now. And, yeah, I prefer other integration methods — I just did something fun with defoverridable/1 for Phoenix LiveView for example — it’s just getting the right data often involves being called from someone else’s code.

tsloughter commented 4 years ago

So you aren't using telemetry with OC, right?

garthk commented 4 years ago

With Ecto, yes, but there's no start/stop so I create the span record and call oc_reporter:store_span/1 myself.

My OT telemetry experiment was winding spans up with oc_trace:finish_span/1 and ocp:with_span_ctx/1… an outer span calling ocp:finish_span/0 afterwards will end up with a weird result, no?

garthk commented 4 years ago

I've spent some time with StreamData and ExUnitProperties, and I'm understanding how the code is behaving a little better. I'm still having trouble figuring out the intent behind that behaviour.

The name had me convinced :ot_tracer.set_span/2 is idempotent. Instead, :ot_tracer.start_span/3 followed by :ot_tracer.set_span/2 pushes twice so when I call :ot_tracer.end_span/1 the span context doesn't change.

deps/opentelemetry_api/src/ot_tracer.erl declares end_span/2 takes a tracer context:

-spec end_span(opentelemetry:tracer(), ot_tracer:tracer_ctx()) -> boolean() | {error, term()}.

… but deps/opentelemetry/src/ot_tracer_default.erl declares it takes a span context:

-spec end_span(opentelemetry:tracer(), opentelemetry:span_ctx()) -> boolean() | {error, term()}.

I'm still confused, from the perspective of a package maintainer or app developer, of why I can (mostly) control the span context but I can only read the tracer context.

Combined, it's very tricky: I've got an end_span/1 that ends the current span according to the context and restores its previous context, which is great unless someone already pushed more than they popped. I can't put the correct context back myself so end_span/1 does what I need it to do, and I'm not entirely clear on what end_span/2 ends.

tsloughter commented 4 years ago

That end_span in the SDK is not to be used by end users at all. It is used by the sweeper and monitor process that ends spans outside of their context.

tsloughter commented 4 years ago

We could consider a name change if it is too confusing in the Erlang/Elixir setting to use set_span but instead use something like add_span. Maybe it helps to think of it like a scope, which is how other languages are able to implement it but we aren't, here is an attempt at pseudo code:

try Span1 = tracer.start_span("span-1") {
  // span-1 is the current active span
  try tracer.start_span("span-2") {
    // span-2 is the current active span
    try tracer.with_span(Span1) {
      // span-1 is current active span
    } // ends span-1

    // span-2 is again the current active span 
    ...
  } // ends span-2
} // tries to end span-1 but it was already ended

The only option I can think of to have a similar effect in Erlang is to require a function and not have any support for starting or setting the current span without passing a function to be called, but it would be the same result when ending a span after having used set_span

ot_tracer:start_span("span-1",
                     fun() ->
                             %% span-1 is current active span
                             Span1 = ot_tracer:current_span_ctx(),
                             ot_tracer:start_span("span-2",
                                                  fun() ->
                                                          %% span-2 is current active span
                                                          ot_tracer:with_span(Span1,
                                                                             fun() ->
                                                                                     %% span-1 is current active span
                                                                             end) %% span-1 is ended
                                                          %% span-2 is active again
                                                  end) %% span-2 is ended
                             %% span-1 is active again
                     end). %% attempts to end span-1 but finds it is already ended