opentracing / specification

A place to document (and discuss) the OpenTracing specification. 🛑 This project is DEPRECATED! https://github.com/opentracing/specification/issues/163
http://opentracing.io/spec
Apache License 2.0
1.18k stars 182 forks source link

Scope Manager should use continuation passing #126

Open pauldraper opened 6 years ago

pauldraper commented 6 years ago

https://github.com/opentracing/specification/blob/master/rfc/scope_manager.md

Problems

1. Scope Manager is hard to understand and easy to misuse

  1. Must I always close an activated scope?
  2. If the thread exits, is the scope deactivated (and the span closed)?
  3. What happens if I close a scope twice? (Spec answers: undefined)
  4. Can I close a scope from a different thread than I created it?
  5. Are previous scopes automatically restored? If so, what do overlapping scope calls mean?
scope1 = scope_manager.activate(span, false)
scope2 = scope_manager.activate(span, false)
scope1.close()
scope2.close()

APIs should be easy to understand and hard to misuse. This is neither. opentracing-java discussion has been a barrage of variations on this idea with edge cases, competing priorities, awkward APIs, unintuitive descriptions, and general confusion. When the conversation is muddled, it's a sign the best answer lies along a different direction.

2. Scope Manager causes memory leaks where there were none before

Last I checked, Java auto-reactivates the last active span. This was a problem with opentracing-java and caused crashes at Lucid Software in a couple messy parts of the code.

void main(ExecutorService executor) {
  executor.submit(new Runnable {
    public void run() {
      Thread.sleep(1);
      main(executor);
    }
  })
}

// okay
main(Executors.newSingleTheadExecutor());
// memory leak
main(tracingExecutor(Executors.newSingleTheadExecutor()));

Unbounded async stacks are not a new problem, but (for Java, at least), this is an entirely avoidable problem for Scope Manager.

3. In continuation-based languages, it extremely hard to understand

I've spent the past several days on implementations of the RFC for JavaScript, with both Node.js async_hooks and Zone.js.

  1. How long does the scope last, and when should it be closed?
const scope = scopeManager.activate(span, false);
const promise = shouldHaveActiveSpan1();
shouldHaveActiveSpan2();
// A
await promise;
// B
shouldNotHaveActiveSpan();

i. Does the scope last until A? Does it automatically end, or do I have to close it?

ii. Or does it last until through B and needs to be closed there?

iii. Or the scope last until A and another second scope is implicitly created at B and that second scope needs to be closed?

I believe (iii) is the "sensible" solution .

const scope = scopeManager.activate(span, false);
const promise = shouldHaveActiveSpan1();
shouldHaveActiveSpan2();
  // A
promise.then(() => {
  // B
  shouldNotHaveActiveSpan();
});
  1. Solutions require either (1) leaking memory creating Scope Managers (2) an explicit enable/disable step (3) WeakMaps of Scope Managers, which is non-trivial.

  2. Every continuation local storage solution is instead is based on callbacks, where the storage is scope to a callback and it's transitive calls.

run(session => {
  session.set(value);
  // ...
  session.get();
});

https://github.com/othiym23/node-continuation-local-storage https://github.com/angular/zone.js/ https://nodejs.org/api/domain.html

Even the Go hack for thread locals uses a CPS API: https://godoc.org/github.com/jtolds/gls.

Solution

Similar to my proposal Feb 2017, https://github.com/opentracing/specification/issues/23#issuecomment-282151459 in-process propagation should have a continuation-based API.

Proposal

The entire API could be essentially as simple as:




interface ScopeManger {
  active(): Span;
  execute(Span span, f: () => void): void; 
}



That's it.

There's no open pits for users to injure themselves, no complicated questions. It unambiguously shadows and restores stacks. And it lends itself to "obviously correct" implementations.

Downsides:

  1. It typically adds at least one and probably two frames to the stack for each activated span.

  2. Fitting in some imperative boxes become hard or even impossible.

interface Filter {
  after(request: Request, response: Response);
  before(request: Request, response: Response);
}

IDK how often that pattern comes up. FWIW, the Java servlet standard uses CPS.

It would be possible to have imperative exceptions in certain cases, e.g. Java supports a raw set(Span).

But for 90% of the time and most languages, CPS is a really straightforward, robust, and broadly applicable way to augment scoping/flow control.

pauldraper commented 6 years ago

I've added an implementation of CPS-style JS span activation https://github.com/opentracing/opentracing-javascript/pull/113.

For example, the implementation using the Zone TC39 proposal:

const propertyKey = 'opentracing';

const zoneName = 'opentracing';

interface PropertyData {
    owner: ZoneSpanManager;
    span: Span | null;
}

export class ZoneSpanManager implements SpanManager {
    active(): Span | null {
        for (let zone = Zone.current; zone = zone.getZoneWith(propertyKey), zone; zone = zone.parent) {
            const { owner, span } = zone.get('opentracing') as PropertyData;
            if (owner === this) {
                return span;
            }
        }
        return null;
    }

    activate<A>(span: Span | null, f: () => A): A {
        const zone = Zone.current.fork({
            name: zoneName,
            properties: {
                [propertyKey]: {owner: this, span}
            }
        });
        return zone.run(f);
    }
}

I also tried implementing the current API. Didn't wind up nearly as clean and I still had questions about edge cases for the implementation. https://github.com/rivethealth/opentracing-javascript/blob/pauldraper/wip/src/zone_scope_manager.ts

alloy commented 6 years ago

Drive-by comment by a naive reader here: the datadog client has its own implementation https://github.com/DataDog/dd-trace-js/tree/master/src/scope, which may be of interest to you.

(I’m no author of that code, I have only been diving into this code as a user for the past few days and was looking for details on how scoping worked.)

pauldraper commented 6 years ago

@alloy, that is of great interest, thanks!

Until last month, Datadog used a CPS-style CLS library. I didn't know they had replaced it.

The current imperative RFC can be implemented , though that library is a good example of how non-trivial and non-obvious the implementation is in JavaScript.

alloy commented 6 years ago

@rochdev cc ^

rochdev commented 6 years ago

We replaced CLS with a scope manager that is as close to the specification as possible as we planned to bring that work to opentracing-javascript later on.

Basically we had similar thoughts about how this should work, and here were our findings so far:

We used a combination of continuation-local-storage and cls-hooked at first. The idea was to use the former for Node <8 and the latter for Node >=8. This proved to be difficult for a few reasons:

We then ended up using only cls-hooked, but this proved to be inflexible for a few scenarios. It also uses callbacks, which are incompatible with the OpenTracing ScopeManager specification. We decided to implement the specification so that we would be ready when it officially lands in JavaScript, and also wanted to eventually contribute our code to the opentracing-javascript repository.

As far a closing the scope, the approach we took currently is to automatically close the scope only when the entire asynchronous context has ended, including any child contexts. This approach has proven difficult but not impossible to implement. This is effectively ii in the possible approaches mentioned above.

I think it would be possible to actually have the 2 APIs available. There has been talks of a higher level API, and I think this could be a great use case for this. Basically the scope manager would be the low-level construct, and the CPS approach that was proposed by @pauldraper could be the high-level API that abstracts the complexity of the scope manager. This would however come with a few restrictions of how the scope manager is implemented, especially related to when the scope should be closed.

pauldraper commented 6 years ago

As far a closing the scope, the approach we took currently is to automatically close the scope only when the entire asynchronous context has ended, including any child contexts.

I think that's how Java works, with its AutoFinishScope.

My question was really about manually closing the scope. At what point in this code should what scope(s) be manually closed to have the commented effect?

// ES 2017
const scope = scopeManager.activate(span, false);
const promise = f1(); // active
f2(); // active
await promise;
f3(); // active
f4(); // not active
// ES 2016
const scope = scopeManager.activate(span, false);
const promise = f1(); // active
f2(); // active 
promise.then(() => {
  f3(); // active
  f4(); // not active
});

Or in Java

// Java 8
Scope scope = scopeManager.activate(span, false);
CompletableFuture promise = CompletableFuture.supplyAsync(f1, hookedExecutor); // active
f2(); // active
promise.thenApply(result -> {
  f3(); // active
  f4(); // not active
});

I'd say in between f3() and f4(), but can it be closed from another thread? Or is a new scope created and you have to call active() to get a reference to it?

@carlosalberto

carlosalberto commented 6 years ago

Hey @pauldraper

Thanks a lot for the detailed explanation. Sorry for the delayed answer - wanted to gather together some information before providing the proper information ;)

  1. Must I always close an activated scope?
  2. If the thread exits, is the scope deactivated (and the span closed)?
  3. What happens if I close a scope twice? (Spec answers: undefined)
  4. Can I close a scope from a different thread than I created it?
  5. Are previous scopes automatically restored? If so, what do overlapping scope calls mean?

Quick answers on those:

  1. Yes.
  2. No (unless you put hooks in a thread exit event).
  3. Correct, that is undefined (but personally I'd like it to do nothing ;) )
  4. No (this is mentioned briefly in the README of the Java repo)
  5. Part of this is mentioned in #122 (although that is still a work in progress and needs a few iterations ;) )

So my feeling is that we need to do an even better effort at making things clear - we are probably in a state where insiders know pretty well the expected API behavior, thus we have been forgetting to make those details widely public. Hopefully I can put some more love into #122 to solve this (and later, mention that in the README of each language implementation).

Scope Manager causes memory leaks where there were none before

Regarding this, @yurishkuro had previously mentioned the possibility to add a ScopeManager.clear() method to get rid of any leaked Scopes. Sounds it's time to give it a try ;)

In continuation-based languages, it extremely hard to understand

At the moment both the Python and C# implementations take this into account (the former when instrumenting gevent, tornado and similar frameworks, using coroutine-local storage), and the latter using the async/await construct (using a propagated ThreadLocal storage).

# A Scope is created, kept around, automatically restored after run_coroutine_a()
# and finally closed/deactivated.
# For the children, it is expected that Span is automatically propagated and set as active,
# probably with the help of some additional instrumentation code.
with tracer.start_active_span('foo') as scope:
   run_something()
   await run_coroutine_a()
   run_something_else()

For C# it's even simpler:

// A Scope is created and automatically automatically restored after
// the task (coroutine-alike) finishes. `
# AsyncLocal` is used to store the active `Span`, so it gets *automatically* propagated.
using (IScope scope = tracer.BuildSpan("foo").StartActive()) {
  RunSomething();
  await RunATask();
  RunSomethingElse();
}

Fitting in some imperative boxes become hard or even impossible.

My impression, after the iterations on this topic, is that most developers want more detailed, manual control over things (take, for example, the desire to remove automatic finish of Span on Scope.close()).

However, what has worked for other languages may not work for the rest (Go is an example of that too). And my impression, given the nature of Javascript and and the challenges I knew about its context propagation (even before hearing of DDog's effort), is that you may be on the right track for it (and obviously, if somebody thinks somebody else, feel free to jump in and add feedback ;) )

carlosalberto commented 6 years ago

Hey @rochdev

think it would be possible to actually have the 2 APIs available. There has been talks of a higher level API, and I think this could be a great use case for this. Basically the scope manager would be the low-level construct, and the CPS approach that was proposed by @pauldraper could be the high-level API that abstracts the complexity of the scope manager.

Oh, indeed, I was thinking the same - @tedsuo had proposed a high level API sitting on top of what we have now, so it's definitely something worth trying out (eventually, that is).

carlosalberto commented 6 years ago

On the latest @pauldraper comment:

I'd say in between f3() and f4(), but can it be closed from another thread? Or is a new scope created and you have to call active() to get a reference to it?

So with the current API you'd need to get a Scope here yourself, as Scope objects are not thread-safe, and not intended for being passed between threads - but Span is:

// Java 8
Scope scope = scopeManager.activate(span, false);
CompletableFuture promise = CompletableFuture.supplyAsync(f1, hookedExecutor); // active
f2(); // active
promise.thenApply(result -> {
  try (Scope scope = tracer.scopeManager.activate(span, false) { // Simplified, etc.
    f3(); // active
  }
  f4(); // not active
});

(Another reminder we need to work better at the spec and documentation ;) )

yurishkuro commented 6 years ago

@carlosalberto I haven't tried if scope managers are easy to extend for other functionality (e.g. Brave has a bunch of hooks into its scope managers), but one thing that would be super useful as extension is an optional verifier that would check that a scope is never accessed from another thread aside from the one that created it. I think we should make it clear that scopes must always be "scoped" to a single thread.

rochdev commented 6 years ago

So far I have an implementation that we use in our tracer since we went GA. I'm currently reworking it to be simpler and more efficient.

We've had many iterations and at this point are confident that it should be implemented like this:

For example:

const scope = tracer.scopeManager().activate(tracer.startSpan('A'), true)
const promise = Promise.resolve()

setImmediate(() => {
  // active scope is "A"

  tracer.scopeManager().activate(tracer.startSpan('B'))

  promise
    .then(() => {
      // active scope is "B"

      tracer.scopeManager().activate(tracer.startSpan('C'))

      // scope C will be automatically closed here since it wasn't closed explicitly before the end of the context
    })
    .then(() => {
      // active scope is "B"
      // span of scope A will be finished after this context ends since there are no more descendant asynchronous contexts and finishSpanOnClose was set to true
    })

  // scope B will be automatically closed here since it wasn't closed explicitly before the end of the context
})

scope.close()

// there is no active scope at this point

Please let me know if you need clarification or disagree with any of this as we are actively refactoring our scope manager, and would like to start work to reintegrate it in opentracing-javascript afterwards. I'll post a link to the refactor branch as soon as it's in a working state.

cc @yurishkuro @carlosalberto @tedsuo @pauldraper

rochdev commented 6 years ago

When finishSpanOnClose is used, the span will be finished only after all descendant asynchronous context have exited. The idea is to provide a way to handle asynchronous operations that are not using any sort of callback mechanism for providers that don't support FOLLOWS_FROM. It could possibly be configurable to disable this tracking functionality for providers who support FOLLOWS_FROM and don't need it.

An alternative would be to require the user to manually finish the scope in these edge cases. This would make the implementation much simpler and make it a lot easier to support other older Node versions and browsers.

From https://github.com/opentracing/opentracing-javascript/pull/113:

Just about every JS continuation library uses continuation passing style

I would like to point out however that if you look at the code of for example continuation-local-storage, all that ns.run(callback) is doing basically is:

ns.run(callback) is simply sugar syntax on top of this. Of course it's a lot less prone to errors, but the point is that it demonstrates it's possible to have the 2 APIs.

rochdev commented 6 years ago

@pauldraper After discussing this in Gitter, it seems the API doesn't have to be exactly the same as the spec, but the semantics should be the same.

So in theory it should be possible to use an API similar you proposed but with the constructs from the spec:

interface ScopeManager {
    active(): Scope;
    activate(span: Span, f: (scope: Scope) => void): void; 
}

activate() should accept null to make it possible to bind a function without a scope.

Not sure if it makes sense to still expose scope.close() with this API, but maybe there is a use case for closing the scope early.

pauldraper commented 6 years ago

Promise .then() callback will be scoped to when .then() itself was called (similar to Node domains). I chose this mostly because I figured Node maintainers probably had a good reason for going with this. We could try to find more information about this.

Logically speaking, the callback is a child of the execution at the time the callback was created (not whatever execution completed the Promise). Zone.js chooses the same thing. I would think java.concurrent.CompletionStage instrumentation follows the same principle.

it seems the API doesn't have to be exactly the same as the spec, but the semantics should be the same. So in theory it should be possible to use an API similar you proposed but with the constructs from the spec:

If you use CPS, there's no point to having a Scope to determine span activation. The callback provides the scoping already. The raw simplicity is one of the advantages of CPS here.

interface ScopeManger {
  active(): Span;
  execute(Span span, f: () => void): void; 
}

Honest question: Does any one have an actual concrete example where you want to do something other than

scope = scopeManager.activate(span)
... some synchronous work ...
scope.close()

If there is a more complicated API, there should be some very tangible, real-world examples to justify it.


As @rochdev has out, the real difficult of the RFC, CPS or not, is the automatic close. That is, closing the span one every scope descendant has closed.

It complicates things because it requires more extensive monitoring of async tasks. E.g.

function f() {
  ...
}

const scope = scopeManager.active(span, true);
// https://github.com/tc39/proposal-observable
// https://github.com/ReactiveX/rxjs
const observable = ...
observable.subscribe(f);
...
scope.close();

I now have to track when the observable has finished, or the when the subscription has been unsubscribed, so that I know when all child scopes have ended.

And there are common APIs that make that even harder to do like DOM EventTarget which deduplicates listeners.

Question: What is the concrete example where this is useful?

Again, making very complex-to-implement standards needs to have some real, tangible, compensory justification. What is it?

rochdev commented 6 years ago

Logically speaking, the callback is a child of the execution at the time the callback was created (not whatever execution completed the Promise).

Just to be clear this is exactly what I'm proposing (wasn't sure if this was an argument or validation of the point).

If you use CPS, there's no point to having a Scope to determine span activation. The callback provides the scoping already.

Totally agree. I was just trying to stick to the spec constructs but since it's unnecessary it can probably be the span directly. However, the context that lives inside the callback should definitely be called the "scope" as a concept (i.e. in docs).

Does any one have an actual concrete example where you want to do something other than

scope = scopeManager.activate(span)
... some synchronous work ...
scope.close()

Technically, for this specific imperative example I'd say the most common one is async/await:

const scope = scopeManager.activate(span)
await somePromise()
scope.close() // this no longer runs synchronously

We could then consider that this is yet another use case that is better served by using CPS:

scopeManager.activate(span, async span => {
  await somePromise()
  // scope can be closed synchronously in the wrapper without waiting
})

Maybe there are other use cases, but I can't think of one at a glance.

Question: What is the concrete example where this is useful?

Not many to be honest :) And you're absolutely right that it comes with not only additional complexity but also performance implications. In my most recent implementation testing I've removed this tracking.

pauldraper commented 6 years ago

Just to be clear this is exactly what I'm proposing

Sorry for the ambiguity. I was further justifying that.

Technically, for this specific imperative example I'd say the most common one is async/await: We could then consider that this is yet another use case that is better served by using CPS:

Yes. This goes back to some of my original questions, but I believe that you'd want the imperative one to be.

scope = scopeManager.active(span)
promise = somePromise()
scope.close()
await promise

or equivalently

scope = scopeManager.active(span)
somePromise().then(() => ...)
scope.close()

But I don't believe there is a correct way to close the scope without breaking apart lines

scope = scopeManage.active(span)
// scope.close() -- wrong
await somePromise()
// scope.close() -- wrong

Not many to be honest :)

:/

rochdev commented 6 years ago

Simple implementation using CPS: https://github.com/DataDog/dd-trace-js/blob/cps-scope-manager/src/scope/new/scope_manager.js

Sorry if the semantics are a bit weird in a few places as I didn't update the naming from the previous implementation.

pauldraper commented 5 years ago

Part of this is mentioned in #122

Hopefully I can put some more love into #122 to solve this

Meanwhile, more than a year later....

I think it should be obvious by this point that there is an absurd amount of complexity around what is essentially already a solved problem. (https://github.com/angular/zone.js/, https://github.com/royalrover/threadlocal, https://github.com/othiym23/node-continuation-local-storage)

How this works is so trivially obvious:

interface ScopeManger {
  active(): Span;
  execute<T>(Span span, f: () => T): T; 
}

Whereas implementing the monstrosity of Opentracing context manager is so difficult that many months later no one has been able to do it in opentracing-javascript (https://github.com/opentracing/opentracing-javascript/issues/112 https://github.com/aws/aws-xray-sdk-node/issues/60).


I wonder why OpenTracing even tries to pretend to bring multiple languages under one standard; the decisions seem so obvious in their disregard of other languages.

Originally, it started with Go programmers. Go has no thread locals, so OpenTracing ignores context propagation altogether.

OpenTracing subsequently fails over and over again to be adopted into Java or Python stacks because it lacks the feature that APMs have had since day 1.

After lengthy discussion, OpenTracing finally adds context propagation but then gears it very specifically towards a certain concurrency model, and adds all sorts of noise like auto closing.

And so in the end, I'm stuck writing my own thing because everyone is handicapped by a couple special languages that the OpenTracing authors happened to know.

rochdev commented 5 years ago

@pauldraper One of the reasons the scope manager didn't happen in OpenTracing is that most of the effort has shifted to OpenTelemetry as it will replace OpenTracing soon.

While the language bias towards Java has unfortunately not been addressed in OpenTelemetry, the scope manager is at least already merged and is actively being improved.

I would recommend to move any discussion related to the scope manager over there.

carlosalberto commented 5 years ago

Hey @pauldraper

This is possibly a good time to join the OpenTelemetry Javascript effort to provide feedback and insight into the in-process propagation part.

Btw, agree that the ScopeManager concept was not something for all languages ;)