googleapis / google-cloud-go

Google Cloud Client Libraries for Go.
https://cloud.google.com/go/docs/reference
Apache License 2.0
3.71k stars 1.27k forks source link

pubsub: native OpenTelemetry support #4665

Closed ianonavy closed 2 weeks ago

ianonavy commented 3 years ago

cc @aabmass

Is your feature request related to a problem? Please describe. As developer using PubSub via the Golang client lib, I would like to propagate OpenTelemetry span contexts via messages so that I can better understand the entire lifecycle of a message.

Describe the solution you'd like I would expect a working solution to look similar to the NodeJS implementation.

Maybe something like this?

client, _ := pubsub.NewClientWithConfig(ctx, PROJECT_ID, pubsub.ClientConfig{
    OpenTelemetryEnabled: true,
})

Then I would expect the result to look something like this when configured with a Cloud Trace exporter:

Screenshot of trace waterfall showing spans across publisher and consumer

Image is from this article by the former intern who implemented the NodeJS and Python versions of this request.

Describe alternatives you've considered OpenTelemetry-OpenCensus Bridge would be great if PubSub were already instrumented to propagate OpenCensus spans through messages like this, but unfortunately the behavior I describe is orthogonal to the existing issue to port the existing tracing from OpenCensus to OpenTelemetry.

hongalex commented 3 years ago

Hiya,

Thanks for filing this issue. OT tracing support is definitely something that's on our roadmap and we're actively planning to support this. At the moment, it's mostly blocked on standardization across Pub/Sub client library languages and waiting on the Go implementation for 1.0. I'll post here for updates.

Also appreciate the detail info. If you have specific asks or features you want, I'd be happy to relay them to the rest of the client library maintainers while we work on this.

ianonavy commented 2 years ago

Hi again. otel-go released 1.0 yesterday! 🎉

Could you please shed some insight on what

standardization across Pub/Sub client library languages

means?

I talked to @aabmass and @weyert on the CNCF Slack about the googclient_OpenTelemetrySpanContext format from the NodeJS (and WIP Python) implementations, and I understand there was some internal discussion about the merits of that format over say something based on traceparent. Is that what you mean by "standardization across [...] languages"? If so, I am happy to opine from a user perspective as well. 😄

hongalex commented 2 years ago

Hey! Yeah we (myself and the Node library maintainer) are in contact with @aabmass. Yeah, that's one of the few things that we're trying to standardize though we're not 100% sure how we're going to fix that for Node since it's already released. We're discussing the merits of making a breaking change/major version bump or just supporting both attributes.

Among the other things is what parts of the publishing/subscribing we want to trace (and what the end result should look like). We kind of have an idea about this from the Node implementation, but it seems like it might have issues. I'll start hacking away at draft PR in Go and ask for feedback there if you're willing to help with that!

weyert commented 2 years ago

Did you had any luck with this?

hongalex commented 2 years ago

Sorry for the lack of an update. Yeah, I've had success with a basic span that encompasses send/receive/consume. I'm trying to make sure the spans look like what you get out of other messaging systems, following semconv, and should have a draft ready by end of this week.

weyert commented 2 years ago

@hongalex That's great! If that's the case I wil try to match semconv usage it in the node-library. Still not sure how to deal with message handling in node

hongalex commented 2 years ago

Could you clarify what the issue with message handling is? I did notice that there is some issue with context propagation in Node (contexts are marshalled with json instead of a propagator that follows W3C tracestate/traceparent conventions).

One thing that I'm still missing right now in Go is linking spans when doing batch receiving](https://github.com/open-telemetry/opentelemetry-specification/blob/v1.1.0/specification/trace/semantic_conventions/messaging.md#batch-receiving). After this, the PR should be ready for feedback.

weyert commented 2 years ago

Could you clarify what the issue with message handling is? I did notice that there is some issue with context propagation in Node (contexts are marshalled with json instead of a propagator that follows W3C tracestate/traceparent conventions).

Yeah, I meant that there is a ticket in the node project were is said the context propagation is not working as expected so that's something I would like to solve. See: https://github.com/googleapis/nodejs-pubsub/issues/1389

Alongside that I want to move towards the same approach you are using with the W3C tracestate/traceparent conventions. For the node-library I am to going to worry about batch receiving yet.

hongalex commented 2 years ago

I finally opened #5034, please take a look/try it out for yourself. It's very much a work in progress still (need to figure out how to pass messaging.destination to the subscribe side, benchmarking message size calculations).

weyert commented 2 years ago

Thanks, I will check it out :)

Mistic92 commented 2 years ago

Hi, what is the current status of that feature?

jjmengze commented 1 year ago

Ho folks , is there any news? Very excited to google support otel

quzhi1 commented 1 year ago

We really need this feature. Right now we don't have a very elegant way of passing trace info from publisher to subscriber.

billinghamj commented 1 year ago

What is the situ here, given https://github.com/googleapis/google-cloud-go/pull/5034#issuecomment-1142720342? If we want to keep a consistent trace across all messages, is GCP going to prevent that?

jjmengze commented 1 year ago

is there any updated?

quzhi1 commented 1 year ago

I wrote a blog about how to enable tracing in Google PubSub. Basically you need to wrap tracing information inside message attribute. This is the link to my blog: https://www.nylas.com/blog/distributed-tracing-with-opentelemetry/. Search for "Tracing PubSub" and you will find my helper functions.

zwolsman commented 1 year ago

Is there any update as of when this will be picked up again/available? 😄

ghost commented 10 months ago

Any update here? Seems this requirement is strong.

mhv666 commented 6 months ago

Same question here, any plans to implement it? 🙇🏽 Thanks

hongalex commented 5 months ago

Thanks for the patience on this. We're still working out a few kinks, but if you would be interested in trying out this, it is available to try out as a beta release. This can be done by running go get cloud.google.com/go/pubsub@v1.37.0-beta.otel.trace.

Official documentation is still pending, but you should be able to see traces once you do the following

  1. Instantiate an exporter + tracer provider (Cloud Trace exporter example)
  2. On Pub/Sub client instantiation, set the client config EnableOpenTelemetryTracing to true.
  3. Publish and subscribe to messages as normal using this new version of the client library.

Each message is given its own trace, from publish to subscribe. Batch RPC spans (publish, ack, modack) will live in their own traces, but durations can be viewed on the message trace via span events.

image

While we are close to launching this feature, we welcome any feedback.

Edit: here are basic publish and subscribe examples.

billinghamj commented 5 months ago

@hongalex Could you clarify how the new support connects with this? https://github.com/googleapis/google-cloud-go/pull/5034#issuecomment-1142720342

Has your thinking changed? Are there best practices or suggested options you have in mind?

hongalex commented 5 months ago

The main change is that we are now relying more heavily on links. The design at the time of the previous comment duplicated batch RPC spans to every single message trace. So if you make a publish RPC call with 10 messages in it, then each message trace will contain a copy of the publish RPC span. Now, that span exists in its own trace, and is linked to each message trace it is affiliated with. In addition, in each message trace, we added span events to say how long the RPC took.

There were some span name changes and attribute changes that were made with updated messaging semantic conventions in mind. This is still an evolving standard so we reserve the right to make breaking changes to things like spans, span names, attribute names, and links.

jameshartig commented 5 months ago

One issue I noticed is that the parent's context is not being used in the ack spans so you send up missing some of the ack spans if you are sampling. We use trace.ParentBased as a sampler so if the parent is sampled all of the children are sampled but the problem is that when startSpan is called for the ack spans it uses a new context rather than using the context from the parent span.

jameshartig commented 5 months ago

I left a few comments on the PR, we ran this in production for a while but quickly noticed memory leaks, I believe from the activeSpans map.

hongalex commented 5 months ago

One issue I noticed is that the parent's context is not being used in the ack spans so you send up missing some of the ack spans if you are sampling

This is unfortunately intentional. The short story is that batched spans (like ack) have multiple parents, so it cannot solely rely on parent/child relationships and why I gave it a new context. Instead, batched spans live in their own trace and are linked to the main message trace. I plan to add a new option to link back from message trace -> batched ops traces.

I left a few comments on the PR, we ran this in production for a while but quickly noticed memory leaks, I believe from the activeSpans map.

Thanks for the comments. Fixed the memory leak issues (and others) and cut another release under cloud.google.com/go/pubsub@v1.37.0-beta.otel.trace2

jameshartig commented 4 months ago

@hongalex thanks for the quick turnaround. I can confirm we aren't seeing the memory leak in the latest release.

tcnghia commented 3 months ago

@hongalex do you have an estimated timeline for the launch of this feature?

Also, do you plan to cut regular releases following the main versions (1.38.0 now)?

Thanks a lot!

hongalex commented 2 weeks ago

Thanks all for your patience. This feature is now released as part of cloud.google.com/go/pubsub@v1.42.0

Please use the issue tracker for any problems you may notice while enabling tracing.

As a reminder, tracing is GA, but span names, attributes, links, and other parts of the trace are subject to change as we align with OpenTelemetry semantic conventions messaging standards.