Closed redpublic closed 4 years ago
You can use current_trace
and current_span
to get the ongoing trace/span, and then you pass them manually across process boundaries. There, you can continue it with continue_trace_from_span
or continue_trace
.
The Spandex API around this is a bit rough (I hope we can make it smoother as we transition to a converged OpenTelemetry
API), but I described an example of wrapping around Elixir's Task.async
in a recent presentation I gave at Code BEAM STO
Here's the meat of the slide in question:
@zachdaniel
I don't see current_trace
in the sources, but it's probably possible to implement it in a similar way as current_span
@GregMefford yeah, I thought about it, but in my case I am starting new process via Supervisor.start_child/2
and also wouldn't like to create new span, only pass Trace
for later usage.
Also I see that there is no easy way to start/finish span in a some child process and get updated Trace
struct in parent process. The only option is to change public API of those child processes and pass Trace
struct in call/cast. In Tapper lib such kind of functionality is implemented via GenServer
for every trace. Here I've seen talks regarding ETS strategy which could help with passing Trace
struct between process boundaries, but GenServer
implementation could be more efficient as an ETS strategy could become a bottleneck at some point. Any thoughts?
Ah, you are correct there is only current_span
and current_trace_*id*
. Are you using spandex_datadog
? If so, the idea is that child process don't need to update the parent trace. They just send the spans that were generated on their end when they are done. Datadog stitches them together for you later.
So it is only problematic if you need to update some metadata about the parent span based on something happening in the spawned process.
I see that sending only occurs on finish_trace. For finish_span there is only update Trace
according configured strategy, pdict by default
Ah, yeah this is a flaw in our documentation. You should use finish_trace
in the child process. It won't technically finish the entire trace, just the version of the trace that exists in the child process. When the parent process finishes its trace with finish_trace
, it will be sent to Datadog and Datadog will use the parent span id from the child trace to tie the two together.
Reopening because this, at the very least, highlights some flaws in our documentation. I've been so insanely busy recently, so I really haven't had much chance to work on this stuff.
I'm trying to understand either this is a problem with documentation or finish_span
should have slightly different implementation?
What would we change about finish_span
? We can't send the trace from a call to finish_span
because they could be intending to start a new span. Starting a trace starts a top level span (a Datadog requirement), so it is not possible to know that the trace is complete without a call to finish_trace
To clarify an assumption: we send all the spans in a given trace from a particular Erlang process when finish_trace
is called. So when you cross a process boundary, you need to start a new trace that has the same ID as the parent trace (which is what continue_trace
does). When the child process finishes its work, it calls finish_trace
, which sends up all the spans from that process, but the parent process could have already finished or will finish after the child process, independently. When the parent finishes, it calls its own finish_trace
to send up the spans that its process was responsible for.
Datadog stitches things all back together, as if it had been a trace across several micro-services.
Thanks for this explanation. After many hours of trying to get new spans created inside a thread I was now understand why your examples use continue_trace
and finish_trace
instead of the span functions.
Ah, glad it was helpful and sorry for the confusion and frustration you had to go through! 😅
We are experimenting with some more convenient options for automatic propagation in OpenTelemetry using seq_trace
, which will hopefully eliminates some of the confusion around this in the future. 🚀
I was able to get it working using code that looks like this
def traced_async(module, function_name, args) do
caseTracer.current_context() do
{:ok, tracer_ctx} ->
Task.async(fn ->
Tracer.continue_trace("#{module}.#{function_name}", tracer_ctx)
try do
apply(module, function_name, args)
after
Tracer.finish_trace()
end
end)
{:error, _reason} ->
Task.async(fn ->
apply(module, function_name, args)
end)
end
end
which generates beautiful flame graphs that look like this:
Thanks so much for all your hard work on Spandex. It has made the argument for adopting Elixir at my company so much easier because of the observability it provides!
That looks like the right call :)
What's is the proper way to pass Trace between Elixir processes? I don't see public methods in Tracer interface how to obtain current Trace and Spandex module uses strategy.get_trace for start_span/update_span. Either I'm missing it or maybe you have some plans to implement it in future.