carlosalberto commented 6 years ago

Multithreading Span usage.

Current State: Draft Author: carlosalberto

In the OpenTracing specification, there is no mention of Span behavior under multithreaded applications, nor clarification of any thread-safety requirement. This proposal is thus intended to be explicit about this specific scenario.

Background

Different tracing systems have historically taken different decisions regarding this, with many of them making Spans (and their operations) thread-safe. Yet, it has been shown lately that a few frameworks have decided to not make Spans thread-safe, and instead provide their own ways to handle this scenario. This could in the near future create confusion among both users and tracing authors, as well as incompatibilities among implementations.

Specification Changes

The Span interface adds one or more paragraph detailing the expected behavior of the objects under multithreading scenarios, specially when a single operation may take place through different threads (maybe using a thread pool). Span objects should then either:

Being thread-safe in all its operations (logs, tags, baggage).
Guaranteed to work fine and correctly under the described scenarios.

Use Cases

Application using a thread pool

Multithreaded applications using a thread pool, with the user creating a Span for a given operation, which would be executed as a series of callbacks, each one running on a different, unknown thread.

For many frameworks, this could be approached with the Span being passed between the threads/callbacks during its execution, to add logs, set tags or modify its baggage.

Risk Assessment

The following risks have been identified:

This change may affect tracing systems that have taken for granted that there is no need to provide any form of thread-safety, as they would need to update or change its own API or code.
The clarification may stay too vague or be insufficient in the long term, regarding the frameworks that decide to not provide thread-safety. Thus a proper, clear addition needs to happen for this case.

yurishkuro commented 6 years ago

@wu-sheng It was my understanding that the skywalking agent code is not using OpenTracing, it is coded directly against skyWalking spans. The only part that is OpenTracing compatible is if users interact with the spans directly from the application, via Tracer. Would that not provide sufficient indication that you need to expose a thread-safe version of the span if accessed via OT API, while the internal API used by the agent can be single-threaded?

wu-sheng commented 6 years ago

It was my understanding that the skywalking agent code is not using OpenTracing, it is coded directly against skyWalking spans. The only part that is OpenTracing compatible is if users interact with the spans directly from the application, via Tracer. Would that not provide sufficient indication that you need to expose a thread-safe version of the span if accessed via OT API, while the internal API used by the agent can be single-threaded?

@yurishkuro Yes, for the bridge, we can do that. But, my concern is more about that. You could just consider this as the disadvantage of auto-instrumentation agents.

Java program also runs in a single thread(exclude distributed), except for thread pool or job assignment modules, so nearly all Java trace agent(.NET likely too), store context in Thread(ThreadLocal for Java), and also maintain an active span stack in context for help the plugin(instrumentation) developer easier to process parent/child relationship. New span's parent is also the last active span.

Such as: A->B->C, when C created, B is its parent span as auto and default. If you finish C, then create D, then D is also B's child too. All these are automatic, that is core of auto instrumentation for many languages.

So you can find out, when span(like C) can cross thread, the stack structure fails, because, you can't guarantee the C has been finished before D created, but you may expect that.

That is a deep conflict, and also why I argue the across thread span all the time.

Although, I think you understand, SkyWalking always did instrumentation in auto way, so we don't relied on OT instrumentation very much. If we consider a limited support, such as providing a way to create some span manually, then SkyWalking can say, we only support in that way.

yurishkuro commented 6 years ago

ScopeManager also keeps spans in thread local in a stack-like manner.

@carlosalberto I'm curious, with the async frameworks using ScopeManager, has anyone looked at what happens with that stack? In Jaeger's previous context manager impl we had a clear() method that would always be called by the server endpoint, but ScopeManager doesn't have it.

felixbarny commented 6 years ago

Could you elaborate on the "Application using a thread pool" use case? Why don't you have one Span per thread/callback?

Ensuring visibility across threads is, of course, important. Especially when the spans are reported asynchronously in a different thread than in the one they have been recorded. But that can be assured using cheaper mechanisms, like memory barriers. Not quite sure why requiring full thread safety is needed. Did you do some benchmarks to get a feel for the costs of this? This should go into the risk assessment IMHO.

wu-sheng commented 6 years ago

Why don't you have one Span per thread/callback?

SkyWalking did this too.

Did you do some benchmarks to get a feel for the costs of this?

Please consider the benchmark scenario should be collecting more than 6000 calls per second, each call has at least 15 spans. That is at least what SkyWalking need and promise to support in product use.

That is also my performance concern comes.

carlosalberto commented 6 years ago

Why don't you have one Span per thread/callback?

Sure, this is the route taking by some applications or tracing designs (a new Span per thread) - but some other times you will end up creating a Span in a thread, pass it to next thread (maybe through a callback), one at a time, and finish it when the last point is reached.

So the point of this PR is precisely to get this discussion going, and get some feedback ;)

Please consider the benchmark scenario should be collecting more than 6000 calls per second, each call has at least 15 spans.

I know other tracing systems (Jaeger, LightStep for a start) do thread-safe Spans. Wondering about the performance scenarios (if any, and if so, which ones, etc).

carlosalberto commented 6 years ago

Hey @yurishkuro

with the async frameworks using ScopeManager, has anyone looked at what happens with that stack? In Jaeger's previous context manager impl we had a clear() method that would always be called by the server endpoint

Oh, there was some basic experimentation with Akka (from my side at least), which ended up me being re-introducing the AutoFinishScopeManager intoopentracing-util. That being said, maybe it's time to re-check it (the clear() method would make lots of sense to me, btw).

yurishkuro commented 6 years ago

@carlosalberto would be useful to run Akka etc. examples under profiler to check for memory leaks, because as far as I can tell a naive way of starting non-auto-finish span/scope might leave the scope on the stack forever.

carlosalberto commented 6 years ago

Hey all!

We would like to have a Cross-Language group call on July 13th (next week) and discuss this item as part of it. Please let me know if you guys can make it. I'm specially interested, if possible, in having @wu-sheng and @felixbarny if possible ;)

We will also be probably discussing it over Gitter this week, prior to the call ;)

tedsuo commented 6 years ago

@felixbarny I agree that "Thread safety" should be clarified.

In this case, it should just mean "safety," in the sense that it exceptions will not occur, memory will not be corrupted, and functionality will continue to work. There is no need to guarantee any ordering of operations between threads, etc. In practice there could be more efficient safety guarantees than locks.

opentracing / specification

Add clarification for Span under multithreading scenarios. #120

Multithreading Span usage.

Background

Specification Changes

Use Cases

Application using a thread pool

Risk Assessment