Spans are mixed on concurrent async calls on a Hapi server

scabfos commented 4 years ago

We use zipkin-js to observe our Hapi server and its dependencies and we noticed that when we get multiple concurrent calls, some of the spans are getting assigned to the wrong traces.

We were working on this issue with @jcchavezs and we created a PR with an example to prove with a simple Hapi project that this is in fact happening.

Screen Shot 2020-02-07 at 11 57 56 AM Screen Shot 2020-02-07 at 12 00 32 PM

As you see in the images, when doing multiple concurrent calls, spans are getting assigned to the wrong one.

Steps to reproduce

Clone this PR: https://github.com/openzipkin/zipkin-js-example/pull/28
Start zipkin local server docker run -d -p 9411:9411 openzipkin/zipkin
Run yarn && node index.js
Run ab -c10 -n10 http://127.0.0.1:3000/
Check zipkin
🤷‍♂

If you have any idea or questions you can reach us at @lussn @alexfdz @javigomez Thank you!

jcchavezs commented 4 years ago

Thanks for this @scabfos, as you may know zipkin is a volunteering project and reporting an issue with a breaking example is truly appreciated.

I managed to make it work but I am not sure how to proceed in terms of implementing the fix. The main issue here is that hapi instrumentation works in terms of listeners instead of middleware. If you look at express instrumentation you will see the following:

  return function zipkinExpressMiddleware(req, res, next) {
     ...
     tracer.scoped(() => {
        ...
        next()
      })
  }

This means that everything in next() happens scoped. In hapi case, we subscribe to events onRequest, onPreHandler, onPreResponse but we can't actually wrap with a scope what happens in the handler. If we change the code in the example from:

server.route({
    method: 'GET',
    path: '/',
    config: {
      pre: [
        {
          method: () => {
              return doCall()
          },
          assign: 'users',
        },

into

server.route({
    method: 'GET',
    path: '/',
    config: {
      pre: [
        {
          method: (req) => {
            return tracer.scoped(function() {
              tracer.setId(req._trace_id) // reasigning the traceId in the scope from https://github.com/openzipkin/zipkin-js/blob/master/packages/zipkin-instrumentation-hapi/src/hapiMiddleware.js#L28
              return doCall()
            })
          },
          assign: 'users',
        },

Then everything works as expected.

I am not necessarily an expert in JS but one alternative I see here is that we wrap the whole server and then on server.route we wrap the methods with the scope. This is going to change the way we use instrumentation as we won't use the plugin declaration anymore (it will happen underneath). Explicit wrapping is something for example this library does, hence I am afraid wrapping the whole server is the only alternative I can think of.

I am friendly pinging @hueniverse to see if there is a way to wrap the methods in hapi through plugins cc @adriancole

codefromthecrypt commented 4 years ago

what @jcchavezs said makes sense. We cannot effectively instrument a listener model, unless there is shared state between the listeners.

For example, if it is 100% the case that onX is followed by onY or onZ, and between onX and onY or onZ, user code is executed.. you can use tracer.setId(span) in onX, and revert it in onX or onY.

anuraaga commented 4 years ago

It looks like there is a plugins object where plugins can put request state they need (though this is JavaScript, we could just set any attribute if we really wanted :-) )

https://github.com/hapijs/hapi/blob/master/API.md#-requestplugins

I think this should give the shared state between listeners we need, can't we put the trace context in?

codefromthecrypt commented 4 years ago

one thing to note about the state issue is we are talking about the state around scope. ex which scope to revert to.. ex tracer.setId is similar to java currenttracecontext.newScope

anyway sounds like we have options which is good!

On Sun, Feb 9, 2020, 8:25 PM Anuraag Agrawal notifications@github.com wrote:

It looks like there is a plugins object where plugins can put request state they need (though this is JavaScript, we could just set any attribute if we really wanted :-) )

https://github.com/hapijs/hapi/blob/master/API.md#-requestplugins

I think this should give the shared state between listeners we need, can't we put the trace context in?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openzipkin/zipkin-js/issues/483?email_source=notifications&email_token=AAAPVV4NKV4FWPOWI3T7GT3RCDJLXA5CNFSM4KRNCIA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELHF2LI#issuecomment-583949613, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAPVV3YJVDCJEB2ZAJ3CULRCDJLXANCNFSM4KRNCIAQ .

jcchavezs commented 4 years ago

what jcchavezs said makes sense. We cannot effectively instrument a listener model, unless there is shared state between the listeners. For example, if it is 100% the case that onX is followed by onY or onZ, and between onX and onY or onZ, user code is executed.. you can use tracer.setId(span) in onX, and revert it in onX or onY.

@adriancole we currently do that but id does not work when using promises (this is what I noticed) because you can't wrap the actual handler execution with a scope.

It looks like there is a plugins object where plugins can put request state they need (though this is JavaScript, we could just set any attribute if we really wanted :-) ) https://github.com/hapijs/hapi/blob/master/API.md#-requestplugins I think this should give the shared state between listeners we need, can't we put the trace context in?

@anuraaga we do that already, see https://github.com/openzipkin/zipkin-js/blob/master/packages/zipkin-instrumentation-hapi/src/hapiMiddleware.js#L28

one thing to note about the state issue is we are talking about the state around scope. ex which scope to revert to.. ex tracer.setId is similar to java currenttracecontext.newScope

@adriancole I am not sure about this statement. As I said the problem is that eventhough we set the traceId before and after the handler, the scope inside the handler (when using promises) could be anything so I'd say the only way to solve this is wrap the handler with a scope. Maybe I am wrong but after debugging this issue for several hours those are the conclusions I arrived to.

codefromthecrypt commented 4 years ago

hard for me to drive-by recommend anything. I mainly meant that tracer.setId(req._trace_id) type of thing is the only thing available in callback. scoped is not helpful there.

anuraaga commented 4 years ago

Found that lifecycle methods are called in order here

https://github.com/hapijs/hapi/blob/master/lib/request.js#L455

Indeed, because the result is always awaited, it will be interleaved with other requests.

That being said, it seems that hapi supports multiple server decorators since 17.1.0

https://github.com/hapijs/hapi/commit/307a8d07a79aa512fc9067d2f1ad8db91479e58d

Looking at our instrumentation we wanted to use this but couldn't due to the limit of one. Now that it isn't an issue, can we use that to decorate handlers? Not sure if we have a min version requirement that doesn't work with that though.

https://github.com/openzipkin/zipkin-js/blob/6abcd0406caa8941008c170b2776027b980e5c24/packages/zipkin-instrumentation-hapi/src/hapiMiddleware.js#L32

jcchavezs commented 4 years ago

@anuraaga @adriancole I don't see how the decorate could help us with the wrapping. If the intention is that we can override the server.routes I tried that with no success, one can not decorate existing functions, only add new ones. Maybe I am missing something?

I opened an issue in hapi repo, I hope we can get better insights in there: https://github.com/hapijs/hapi/issues/4049

hueniverse commented 4 years ago

What does it mean to wrap a handler or a pre? What happens when code runs inside the scope() wrapper?

cur3n4 commented 4 years ago

I am encountering this limitation also when trying to use Continuation Local Storage to set some correlation id (similar to zipkin). I can decorate the handler, but in my case it would be better to be able to span this across more than one lifecycle event (i.e. from onRequest to response).

I believe this might have been possible before the toolkit signals were added to the lifecycle methods signature (<=16), as one could have done something like this on the onRequest listener:

namespace.run(() => reply.continue())

cur3n4 commented 4 years ago

By the way we are currently decorating all the handlers in one go as we start the server:

server.table().forEach(route => {
    const matchingRoute = server.match(route.method, route.path);
    matchingRoute.settings.handler = loggingContextDecorator(matchingRoute.settings.handler);
});

And the loggingContextDecorator does something like:

function decorate(handler: Lifecycle.Method) {
    return async (request: Request, ...args) => {

        return ns.runAndReturn(async () => {
            ns.set(SOME_KEY, someValue);

            try {
                return await handler.apply(this, [request, ...args])
            } catch (err) {
                throw err;
            }
        })
    };
}

export function loggingContextDecorator(handler: Lifecycle.Method | object) {
    if (typeof handler  === 'function') {
        return decorate(handler as Lifecycle.Method)
    } else {
        // TODO deal with objects
    }
    return handler;
}

openzipkin / zipkin-js

Spans are mixed on concurrent async calls on a Hapi server #483

Steps to reproduce