clarity around spans with spawned processes

jeffrom commented 6 years ago

hello,

Sorry for the generalness to this question. I'm having some trouble understanding how traces work in the context of many child processes. I understand that i can use Spandex.get_current_span and Spandex.continue_trace to copy the relevant data into the child processes dictionary, but i'm not sure if the continued trace should have a new name. My high-level problem is that we're running a large graphql API (absinthe) where some of the field resolvers run asynchronously via wrapper functions and some do not, and we'd like to use Spandex to make a flame graph where each span represents one resolver. Do you have any suggestions / thoughts about this? Does it make sense to add functions that just copy the process dictionary into a child process without requiring a name?

zachdaniel commented 6 years ago

Hello @jeffrom :). Quick note to start: I've just reverted a bunch of changes from master that were attempting to add a zipkin adapter, as those changes were unstable, and only used in a release candidate. They should not be used, but are available in the branch add-zipkin-support. So the latest stable (non rc release) in hex should line up with the code here.

For your issue: I'm pretty certain that you would not want to copy the process dictionary from one process to another, even if you were only copying the spandex trace part. This would result in double sending a significant amount of spans. Keep in mind that Spandex.continue_trace just starts a new span with the provided name, it doesn't actually start a new trace. So if the desired behavior is one span per resolver, then just doing Spandex.continue_trace("desired_resolver_span_name", trace_id, span_id) would work just fine. Datadog supports sending two lists of spans to their API, where they all have the same trace_id. They will be correlated into the same trace in the end. (you may see some small discrepancies in the UI until they have correlated them).

A few other things to keep in mind: It may not be a sustainable strategy to send that much trace data to datadog. I'm very happy to accept performance improvements to the library in its current state. No real benchmarking has been done to determine the upper bounds on performance, but just make sure that you run load tests on your API before sending a span per resolver to datadog. I would not be surprised if spandex ended up being a bottle-neck if you end up sending hundreds of spans per API request. It all depends on your traffic, and how many resolvers you have.

Probably don't use Spandex.Logger or Spandex.Task. They have proven to be less than useful, and just took up space in our traces where we would rather have included explicit trace/span data. I'd like to remove them in future releases of spandex.

Please, if there are changes you'd like to see in the library, bring them up and we will work on them together.

jeffrom commented 6 years ago

Great, that answers all my questions, and some i forgot to ask! Thanks for that, and your great work on this library.

spandex-project / spandex

clarity around spans with spawned processes #35