googleapis / nodejs-pubsub

Node.js client for Google Cloud Pub/Sub: Ingest event streams from anywhere, at any scale, for simple, reliable, real-time stream analytics.
https://cloud.google.com/pubsub/
Apache License 2.0
516 stars 227 forks source link

OpenTelemetry integration misses the point: we need to propagate the span/context! #1389

Open hermanbanken opened 2 years ago

hermanbanken commented 2 years ago

The code where the Subscriber reads the span from the attributes just creates the span and then throws it away: https://github.com/googleapis/nodejs-pubsub/blob/99107930bbed8e29834acb2347f65efc6a39958f/src/subscriber.ts#L517-L527

A good OpenTelemetry customer would then set the span to the current span in the on("message") handler, so that any subsequent processing is attached to the remote span.

I'll also file a Google support case.

Environment details

Steps to reproduce

  1. Enable OpenTelemetry in both publisher and subscriber using { enableOpenTelemetryTracing: true } options on topic and subscription.
  2. Perform some processing that creates spans, either manually or (for example) do a Datastore Query or Put.

It can be observed that actions are linked to the subscriber span, as can also be seen in this screenshot (spans take <1ms because nothing is done inside):

Screenshot 2021-08-30 at 16 47 23

Relates to #1078.

hermanbanken commented 2 years ago

@sethmaxwl was the scope of the initial OpenTelemetry PR to just link the publish and subscribe side together? I think it missed the point for any consumer application if you want to link the publish to more downstream actions inside the subscriber.

feywind commented 2 years ago

@hermanbanken Thanks for the comments! We are in the middle of trying to harmonize our treatment of OpenTelemetry for Pub/Sub across languages/libraries. The Node one was sort of a pilot project to see how it goes, and we're hoping to make that better. Feel free to add any comments you have (or anyone who wants to comment!) and we'll add it to the discussion.

hermanbanken commented 2 years ago

One remark I have is that it should be documented a little bit more how it is working. For example, the (semantic) key name and the data format of the propagation key could be documented. I believe it was __googPubsubTrace-something and I think it would be nice if that could be not some magic internal constant. Also, why not use the standard w3c trace header name and format?

weyert commented 2 years ago

Good point, I think Aaron suggested the same regarding W3C trace format. Existing propagator should be reusable. I struggled to get it all working with the recent Otel-js. If you want you can raise a PR or I can have a stab at it during my next tech debt day at work

I am guess we should make the span we created the active one? Got to look a bit more into the workings of subscriber class

weyert commented 2 years ago

@hermanbanken Did you hear back from Google Support about your case?

hermanbanken commented 2 years ago

At September 1st they said that they would working on it "next quarter".

Public tracker: https://issuetracker.google.com/issues/183608476

hermanbanken commented 2 years ago

That tracker seems to be inaccessible now... weird.

weyert commented 2 years ago

Going to revisit it this week. Have to solve it for work :)

On Sun, Feb 20, 2022 at 3:43 PM naseemkullah @.***> wrote:

+1

— Reply to this email directly, view it on GitHub https://github.com/googleapis/nodejs-pubsub/issues/1389#issuecomment-1046265027, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAABXCIK5RBIXPARJ4MCLTDU4ED2FANCNFSM5DCBVHCA . You are receiving this because you commented.Message ID: @.***>

naseemkullah commented 2 years ago

I came up with this as a workaround (there is a comment in the event listener about what I think this library should provide):

import {Message, PubSub} from '@google-cloud/pubsub';
import {
  ROOT_CONTEXT,
  SpanContext,
  SpanStatusCode,
  trace,
} from '@opentelemetry/api';

const tracer = trace.getTracer('foo');

const pubsub = new PubSub({projectId: 'abc'});

const enableOpenTelemetryTracing = {
  enableOpenTelemetryTracing: true,
};

export async function subcribe() {
  const subscription = pubsub.subscription(
    'testSubscription',
    enableOpenTelemetryTracing
  );

  subscription
    .on('message', (message: Message) => {
      // it would be convenient if everything in this handler
      // apart from `handleMessage(message)` logic was provided by the GCP pub/sub library
      // when enableOpenTelemetryTracing is true
      const spanContext = getSpanContext(message);

      if (spanContext === undefined) {
        handleMessage(message);
      } else {
        const context = trace.setSpanContext(ROOT_CONTEXT, spanContext);

        tracer.startActiveSpan('handleMessage', {}, context, span => {
          try {
            handleMessage(message);
            span.setStatus({code: SpanStatusCode.OK});
          } catch (err) {
            span.setStatus({
              code: SpanStatusCode.ERROR,
              message: err instanceof Error ? err.message : undefined,
            });
            throw err;
          } finally {
            span.end();
          }
        });
      }
    })
    .on('error', (error: Error) => {
      console.error('Received error:', error);
    });
}

function handleMessage(message: Message) {
  console.log('Received message:', message.data.toString());
  message.ack();
}

function getSpanContext(message: Message): SpanContext | undefined {
  try {
    return JSON.parse(
      message.attributes['googclient_OpenTelemetrySpanContext']
    );
  } catch (error) {
    return undefined;
  }
}

worth noting, while the message acknowledgment and any processing leading up to it are part of the same trace, the /google.pubsub.v1.Subscriber/ModifyAckDeadline span is still disconnected

edit: so apparently "span links" are the proper way to link a messages consumption to its production (as @weyert menitons below), above code snippet should probably not be used

weyert commented 2 years ago

Thank you, that looks useful. I think we might be able to use span links to connect modifyackdeadline and the other cools. But to me having it not linked is already a great improvement

feywind commented 2 years ago

The tracker page is still there for me, internally at least. Weird. Anyway, nothing has happened on this yet, but it's increasing in priority right now. Definitely not forgotten.

weyert commented 2 years ago

Yeah, I hope to have a stab at it soon. So far more important stuff at work than this.

feywind commented 2 years ago

You might not want to put too much into it, just because we've got a revamped design for OT that will be rolling out across all the languages, over time. Of note, it's going to propagate the span and context by way of the W3C spec, as well as doing better about "closing the loop" on the subscriber end. We've got a prototype implementation in Go, and that just needs to be expanded across all the languages.

https://github.com/googleapis/google-cloud-go/pull/5034

If you see anything you want in the Node implementation that's missing from the Go one, please feel free to comment!

weyert commented 2 years ago

Yeah, the changes I wanted to propose via PR would be heavily based on the Go implementation :)

Let me know, if you prefer me to wait, that's fine, happy to wait for a Google implementation if it's coming in short term

GraemeF commented 1 year ago

Great news that this is being fixed! Just spent a day scratching my head as to why my traces were disconnected...

feywind commented 1 year ago

Sorry for the radio silence on this topic, but it's still a major objective for us. None of this is official until it's official (please don't take it as an announcement/promise), but the current thought is that we're planning to shift it to use the more standard mechanism (w3c headers), and to get it piped all the way through. A preview of the Go version produces a really nice graph in Cloud Trace, showing the time spent in each phase of the library (batching, ack leasing, etc). There isn't currently anything that will be traced in the server, but the end to end trace should at least allow you to see how long it spends in the server before being sent out.

feywind commented 1 year ago

Mentioned in the PR, but mentioning here too: there's a suuuuper early preview branch now if anyone wants to play around with what's done. I don't recommend using it in any kind of production scenario or assuming any of the interfaces or trace formats are final.

That PR is here: https://github.com/googleapis/nodejs-pubsub/pull/1659

And the branch in googleapis that anyone can make PRs against if you want: https://github.com/googleapis/nodejs-pubsub/tree/otel-2

I'll be out for a bit but plan to pick it up again in early 2023.

0xdevalias commented 1 year ago

@feywind Just wanted to check in and see how 2023 was treating this work, any updates yet?

feywind commented 1 year ago

I've been posting updates on the PR, but I'm planning to make a preview version soon.

feywind commented 1 year ago

Tiny sample size, I admit, but quick survey for those on this thread and wanting OTel support sooner rather than later: do you consider the trace structures and span names to be part of the public interface? Would it break your code if they changed? Or is it just a human debug tool?

weyert commented 1 year ago

@feywind for me it’s a human debug tool:) I am personally not planning to generate metrics from the traces via like otel-collector

ervin-pactum commented 1 year ago

do you consider the trace structures and span names to be part of the public interface? Would it break your code if they changed? Or is it just a human debug tool?

Mainly it is human debug tool, but there might be some alerting around it, I would expect that changes to span names or changes in amount of spans emitted to be mentioned in changelog/release notes. Mainly because there is some little cost attached to each single span.

0xdevalias commented 1 year ago

Pretty much entirely human debug tool, but mentioning any significant changes in the changelog would be helpful for updating any internal process documentation/etc.

SimenB commented 1 year ago

Same here as others have said - only used for some alerting and saved searches in grafana, "production code" doesn't care

danmichaelo commented 1 year ago

Same here.

feywind commented 1 year ago

Thanks everyone, I appreciate the input. Our main goal for the moment is more of a human debug solution, too, so I think we're all aligned. :) A sort of "soft interface" is an interesting idea, not needing a whole new major as long as updates are mentioned consistently somewhere. That'll be good to take into account.

feywind commented 9 months ago

That #1833 is the version rebased onto main (4.x). I'm planning to get that released and then go back to do any final backporting to 3.x.

feywind commented 3 months ago

Quick update, there are some beta testing instructions over here if anyone is interested.

feywind commented 2 months ago

Has anyone had a chance to try this out? There are a few things that need to be finished up to get this released, but it'll probably be pretty soon.

weyert commented 2 months ago

I have tried it during local development and works pretty well for me!

vesse commented 1 month ago

@feywind I have now been trying the beta in GCP but don't get the pubsub message handler requests to the same trace than from where it was triggered. We use @opentelemetry/auto-instrumentations-node and Pub/Sub push subscriptions with Cloud Run instance being triggered from the messages.

I can see that the incoming request body contains eg.

attributes: {
  googclient_traceparent: "00-4dccfee9342c29e33b8fcf484aa52ea9-6b72c4b7d37f4bd0-01"
}

where the trace ID matches the trace ID of the publisher, but the incoming HTTP request traceparent has different trace ID.

Note: I have very little prior OTEL experience but trying now to set up tracing so that incoming request trace would contain everything the request handling caused. I don't even know if I'm supposed to do something manually or are push subscriptions even supposed to be supported throughout the chain.

feywind commented 1 month ago

@vesse It seems push subscriptions are slightly out of scope for this first pass. You should be able to pull that googclient_traceparent value and use it for further OpenTelemetry tracing.

I'm not sure exactly what is placing a new traceparent in your HTTP headers, but it's something we'll need to figure out for the push support, so thanks for the heads-up :)