spandex-project / spandex

A platform agnostic tracing library
MIT License
335 stars 53 forks source link

Pass Trace between process boundaries #102

Closed redpublic closed 4 years ago

redpublic commented 5 years ago

What's is the proper way to pass Trace between Elixir processes? I don't see public methods in Tracer interface how to obtain current Trace and Spandex module uses strategy.get_trace for start_span/update_span. Either I'm missing it or maybe you have some plans to implement it in future.

zachdaniel commented 5 years ago

You can use current_trace and current_span to get the ongoing trace/span, and then you pass them manually across process boundaries. There, you can continue it with continue_trace_from_span or continue_trace.

GregMefford commented 5 years ago

The Spandex API around this is a bit rough (I hope we can make it smoother as we transition to a converged OpenTelemetry API), but I described an example of wrapping around Elixir's Task.async in a recent presentation I gave at Code BEAM STO

Here's the meat of the slide in question:

image

redpublic commented 5 years ago

@zachdaniel I don't see current_trace in the sources, but it's probably possible to implement it in a similar way as current_span

redpublic commented 5 years ago

@GregMefford yeah, I thought about it, but in my case I am starting new process via Supervisor.start_child/2 and also wouldn't like to create new span, only pass Trace for later usage.

redpublic commented 5 years ago

Also I see that there is no easy way to start/finish span in a some child process and get updated Trace struct in parent process. The only option is to change public API of those child processes and pass Trace struct in call/cast. In Tapper lib such kind of functionality is implemented via GenServer for every trace. Here I've seen talks regarding ETS strategy which could help with passing Trace struct between process boundaries, but GenServer implementation could be more efficient as an ETS strategy could become a bottleneck at some point. Any thoughts?

zachdaniel commented 5 years ago

Ah, you are correct there is only current_span and current_trace_*id*. Are you using spandex_datadog? If so, the idea is that child process don't need to update the parent trace. They just send the spans that were generated on their end when they are done. Datadog stitches them together for you later.

zachdaniel commented 5 years ago

So it is only problematic if you need to update some metadata about the parent span based on something happening in the spawned process.

redpublic commented 5 years ago

I see that sending only occurs on finish_trace. For finish_span there is only update Trace according configured strategy, pdict by default

zachdaniel commented 5 years ago

Ah, yeah this is a flaw in our documentation. You should use finish_trace in the child process. It won't technically finish the entire trace, just the version of the trace that exists in the child process. When the parent process finishes its trace with finish_trace, it will be sent to Datadog and Datadog will use the parent span id from the child trace to tie the two together.

zachdaniel commented 5 years ago

Reopening because this, at the very least, highlights some flaws in our documentation. I've been so insanely busy recently, so I really haven't had much chance to work on this stuff.

redpublic commented 5 years ago

I'm trying to understand either this is a problem with documentation or finish_span should have slightly different implementation?

zachdaniel commented 5 years ago

What would we change about finish_span? We can't send the trace from a call to finish_span because they could be intending to start a new span. Starting a trace starts a top level span (a Datadog requirement), so it is not possible to know that the trace is complete without a call to finish_trace

GregMefford commented 5 years ago

To clarify an assumption: we send all the spans in a given trace from a particular Erlang process when finish_trace is called. So when you cross a process boundary, you need to start a new trace that has the same ID as the parent trace (which is what continue_trace does). When the child process finishes its work, it calls finish_trace, which sends up all the spans from that process, but the parent process could have already finished or will finish after the child process, independently. When the parent finishes, it calls its own finish_trace to send up the spans that its process was responsible for.

Datadog stitches things all back together, as if it had been a trace across several micro-services.

dmccraw commented 4 years ago

Thanks for this explanation. After many hours of trying to get new spans created inside a thread I was now understand why your examples use continue_trace and finish_trace instead of the span functions.

GregMefford commented 4 years ago

Ah, glad it was helpful and sorry for the confusion and frustration you had to go through! 😅

We are experimenting with some more convenient options for automatic propagation in OpenTelemetry using seq_trace, which will hopefully eliminates some of the confusion around this in the future. 🚀

dmccraw commented 4 years ago

I was able to get it working using code that looks like this

  def traced_async(module, function_name, args) do
    caseTracer.current_context() do
      {:ok, tracer_ctx} ->
        Task.async(fn ->
          Tracer.continue_trace("#{module}.#{function_name}", tracer_ctx)

          try do
            apply(module, function_name, args)
          after
           Tracer.finish_trace()
          end
        end)

      {:error, _reason} ->
        Task.async(fn ->
          apply(module, function_name, args)
        end)
    end
  end

which generates beautiful flame graphs that look like this: Screen Shot 2019-12-19 at 9 50 21 AM

Thanks so much for all your hard work on Spandex. It has made the argument for adopting Elixir at my company so much easier because of the observability it provides!

zachdaniel commented 4 years ago

That looks like the right call :)