Unclear lifecycle of spans

felixbarny commented 5 years ago

The way the ScopeManager is currently defined makes the lifecycle of Span objects unclear. This complicates span object recycling aka object pooling.

Recycling objects is a way to reduce the allocation overhead by maintaining a pool of pre-allocated Span instances. Instead of instantiating a new span every time a span is started, a span is taken out of the object pool. When the Span is finished and has been reported out of process, the span instance is reset and put back into the object pool. Using this technique, it is possible to add almost no allocation overhead to the monitored application which means less GC overhead caused by the tracer.

IMO, the lifecycle of a Span should be very straightforward. When finish() is called, no other interactions should be allowed so it is safe to recycle and reuse the whole span instance.

There are two APIs which break this assumption.

Span.context

Quote from The spec of Span.finish:

With the exception of the method to retrieve the Span's SpanContext, none of the below may be called after the Span is finished.

IMO, the SpanContext should be retrieved before the span is finished if it is intended to be used after the span is finished. The implementations are supposed to return an object which is not tied to the span’s lifecycle, for example making a copy of the internal span context.

ScopeManager

Even in the current version of the spec, it “forbids” calls after finish():

With the exception of the method to retrieve the Span's SpanContext, none of the below may be called after the Span is finished.

Scoping a span in a different thread than the one it is finished from, indirectly breaks this requirement. In the second thread, there is no way to know whether the activeSpan() has already finished, as it has no control over the lifecycle of that span. So calling tracer.activeSpan().addTag("foo", "bar") is never legal if the active span is finished in a different thread. The Span reference returned by tracer.activeSpan() might already be recycled and used to record an entirely different operation.

Example scenario:

Thread A	Thread B
Span span1 = startSpan()
Scope scopeSpan1 = scopeManager.activate(span1)
Executor.sumbit(Runnable)	Runnable.run
	Scope scopeSpan1 = scopeManager.activate(span1)
scopeSpan1.close()
span1.finish()
	tracer.activeSpan().addTag("foo", "bar")	Is this span finished or not? The caller has no way to know that. Worst case is that it now represents an entirely different operation.
	Span span2 = startSpan()
	span2.finish()
	scopeSpan1.finish()

For Java, the damage is probably not as big unless the tracer implements object pooling. I don’t have a lot of experience with C++, but don’t you have to exactly know the lifecycle of spans in order to be able to deallocate the span’s memory?

Suggestion

Change

With the exception of the method to retrieve the Span's SpanContext, none of the below may be called after the Span is finished.

To

After the Span is finished no other interactions are allowed, including activating the Span via the scope manager and retrieving the Span's SpanContext.

Add method Scope activate(SpanContext spanContext) to ScopeManager
- Scope.span() returns null in this case.
Add method SpanContext spanContext() to Scope
- In case Scope#activate(Span) has been called, acts as a shortcut for Scope.span().context()

For SpanBuilders which did not call SpanBuilder#ignoreActiveSpan, an implicit child_of reference will be created for ScopeManager.active().spanContext() contains a SpanContext. If ScopeManager.active().span() is not null, this is the preferred way to set the reference.

//cc @adriancole @raphw

felixbarny commented 5 years ago

Even as it's current state, the spec indirectly forbids scoping a span where you don't have control over the lifecycle of the span. Because calls to a finished span are not allowed/undefined (=may blow up the application). But the API makes no restrictions for scopeManager.active(), which makes a promise to that calling methods like setTag on the returned span instance is safe.

Which means that code like this is not legal: https://github.com/opentracing-contrib/java-concurrent/blob/83d9aaa6ea4a8764fefc3809cadff539366aabd8/src/main/java/io/opentracing/contrib/concurrent/TracedRunnable.java#L24

//cc @pavolloffay

felixbarny commented 5 years ago

I have implemented a proof of concept for this: https://github.com/opentracing/opentracing-java/pull/319

tedsuo commented 5 years ago

Hi @felixbarny. I agree that object pooling (and low overhead/GC-avoidance in general) is a worthy goal to pursue. But I'm concerned about safety. Right now, the intention in the spec is to say that calls after finish have an undefined behavior in the sense that they may be inoperative. But that's a bit different from saying calls after finish are unsafe. Generally speaking, instrumentation code should be safe.

For races occurring between threads, I think it's fine to say that ordering is undefined – the user should not be interacting with spans in multiple threads simultaneously, and definitely should not be finishing the same span in multiple threads. The pseudocode in the table you posted would be an example of this. But the fallout for poorly written instrumentation should be poor data, not an unsafe program.

My concern is a little different from whether calling SpanContext after finish should be allowed, btw. I'm just wondering if object pooling and other usecases for making these spec changes would result in unsafe instrumentation code (as opposed to code that records bad data).

felixbarny commented 5 years ago

But the fallout for poorly written instrumentation should be poor data, not an unsafe program.

Not sure how you define unsafe in this context but the worst that can happen when calling methods after finish on a pooled span object is that for example 5 min later (when the pool wraps) a span contains a tag which was supposed to be set on a different span.

and definitely should not be finishing the same span in multiple threads

ATM, I think this is not defined anywhere. Why is it allowed at all to call finish multiple times?

sjoerdtalsma commented 5 years ago

ATM, I think this is not defined anywhere. Why is it allowed at all to call finish multiple times?

Undefined != forbidden. The spec says calls after finish have undefined behaviour, not calls after finish will throw exceptions (or break your program in other ways)

felixbarny commented 5 years ago

No, finish() is defined to behave as a noop when executed multiple times:

With the exception of calls to {@link #context}, this should be the last call made to the span instance. Future calls to {@link #finish} are defined as noops, and future calls to methods other than {@link #context} lead to undefined behavior.

That is in contrast to other span methods, like addTag. And I'm wondering why that is different 🤔.

I think it would be great to not only have the semanic_specification.md but also a separate file which explains the rationale behind the decisions.

The spec says calls after finish have undefined behaviour, not calls after finish will throw exceptions (or break your program in other ways)

Throwing an exception is arguably a valid thing if the behavior is undefined. But of course that would be a silly thing to do in a tracer implementation. Again - object pooling does not break the application in any way it could just lead to bad data.

opentracing / opentracing-java