open-telemetry / opentelemetry-ruby-contrib

Contrib Packages for the OpenTelemetry Ruby API and SDK implementation.
https://opentelemetry.io
Apache License 2.0
79 stars 160 forks source link

bunny instrumentation is not propagating contexts #526

Open mugli opened 1 year ago

mugli commented 1 year ago

Description of the bug

The doc mentions (link):

ℹ The OpenTelemetry Ruby SDK will take care of context propagation as long as your service is leveraging auto-instrumented libraries.

And,

The W3 TraceContext and Baggage propagators are configured by default.

This means we expect the baggage header (among others) to propagate from RabbitMQ publishers to consumers when we use the bunny instrumentation.

However, that is not happening right now. Instead of propagating any contexts (including baggage), the bunny instrumentation drops all contexts in the consumer and starts a new trace.

I got to know from this discussion that bunny instrumentation is trying to follow batch messages receive semantics, that's why it creates new traces and then links them. The initial implementation was also heavily inspired by the ruby_kafa instrumentation.

I don't have an opinion if that's the right way to do this or not. I just checked the AMQP instrumentation in Node.js and found that it doesn't do it like this, it starts a new span from the parent context and propagates context properly.

However, regardless of which message semantic it chooses to follow, I think that not propagating contexts is a bug in the bunny instrumentation.

Additional details:

I believe this is the place where the context is getting lost: https://github.com/open-telemetry/opentelemetry-ruby-contrib/blob/1990634acece4b32f56663a720e18b27a14a2b25/instrumentation/bunny/lib/opentelemetry/instrumentation/bunny/patch_helpers.rb#L35C18-L45

Because here we are creating tracer_receive_headers from scratch instead of taking the values from relevant message headers: https://github.com/open-telemetry/opentelemetry-ruby-contrib/blob/1990634acece4b32f56663a720e18b27a14a2b25/instrumentation/bunny/lib/opentelemetry/instrumentation/bunny/patch_helpers.rb#L47-L64


Share a simplified reproduction if possible

I've created an example repository to reproduce the problems described in the issue: https://github.com/mugli/otel-ctx-prop-amqp-pubsub

mugli commented 1 year ago

Another thing, because it is probably not clear in the original message, I think the context propagation bug is on the consumer side of the bunny instrumentation.

mugli commented 1 year ago

Just fyi, I edited the original post to add an example repository to reproduce the problems described in the issue: https://github.com/mugli/otel-ctx-prop-amqp-pubsub

mugli commented 1 year ago

And since GitHub duscussions and issues do not share references automatically, at least 2 other people mentioned the similar prolem here: https://github.com/open-telemetry/opentelemetry-ruby/discussions/1098

(I mentioned this thread in the original message. It's just so easy to miss that among other things)

mugli commented 1 year ago

"The Bunny instrumentation adheres to the OTel messaging specification and uses Links instead of Child Spans in consumers."

^ I have been thinking about this assertion in the discussion ^. I could not find why this decision was there to follow this specific semantics and not others in the first PR that contributed the gem.

So, from the people who understand the OTEL specs and implementations way better than I, could anyone please share scenarios where not adhering to this particular batch-receiving semantincs would cause me a problem for this particular example I shared here: https://github.com/mugli/otel-ctx-prop-amqp-pubsub. This is a very simple demo, most parts are even directly taken from the basic RabbitMQ tutorials.

If it's not a problem to propagate everything (including the same trace-id), can I create a PR attempting to fix this for the bunny instrumentation? At the same time, it'll allow bunny instrumentation to align with what amqplib instrumentation is doing in the demo. Would you agree to that?

I think the bunny instrumentation right now totally ignores all aspects of context propagation by trying to do it like this. There is more to it than just the traceparent and maintaining the trace-id there. It is already missing tracestate, baggage in the propagation and people can even provide their own custom propagator, those will not work either with this limitation.

It'd be really great if someone can verify that it's alright to do it the way amqplib instrumentation is doing.

bolandoboy commented 1 year ago

Just my 2 cents, as recently I've been surprised and struggling with this as well.

AFAIK bunny doesn't batch messages. Both Queue#subscribe and Subscriber#on_delivery receive a block which in turn receives the payload for a single message.

I would expect the spans created by the reception and processing of that particular message to have the context propagated from the publisher. Prior to migrating to OTEL, in my company we were actually propagating our own "trace id" via headers, as there's a clear 1:1 relationship between the publication and reception of a single message with bunny.

It turns out, to add to the discussion, that we're implementing buffering on top of bunny, so our "client" code actually ends up processing a batch of messages in one go. In that case we definitely have to start a new trace for the batch and link it to all the previous traces of the individual messages, but I don't think that's something that should happen before, at the "bunny level".

bolandoboy commented 1 year ago

One more thing to consider: the Sidekiq instrumentation offers the possibility to propagate contexts or link traces via the propagation_style config option.

Not being a spec expert myself by any means, I interpreted that as "ok, maybe in some systems propagating contexts result in huge traces, so people can choose to "cut the trace" and just use links"... And I thought in the case of bunny that decision had been made for us, although I wondered why not take the Sidekiq approach if possible.

Now, from the discussion, it seems the decision was actually based on semantics. I reiterate that I don't think we're dealing with batch receiving in Bunny. And now I wonder why the choice is offered for Sidekiq, since vanilla Sidekiq doesn't even offer batching, but that's another issue (if it's an issue at all).

In any case, leaving semantics aside, could we leave it to the user to choose the propagation style for Bunny, as it's already implemented for Sidekiq?

robbkidd commented 1 year ago

I'll get oriented on this propagating bunnies issue and see what I can do.

robbkidd commented 1 year ago

An OTEP pertinent to this problem ought to merge into the spec in a few days. (I need to catch up on reading that.)

bolandoboy commented 1 year ago

Thanks @robbkidd

So when the OTEP gets merged, this is going to be the final convention for the scenario we're discussing:

For each message it accounts for, the "Deliver" or "Receive" span SHOULD link to the message's creation context. In addition, if it is possible the creation context MAY be set as a parent of the "Deliver" or "Receive" span.

The current Bunny instrumentation will meet the SHOULD part without any modification, as it's already using links. Hopefully we can consider the MAY part, making the "Deliver" span a child of the corresponding "Publish"/"Create" span...? (As shown in the OTEP examples)

I can see it's not an obvious decision, though. The trade-off is expressed in the OTEP itself:

Keeping single-messages operations in the same trace can greatly improve the user experience.

And:

However, in some instances, adding this parent/child relationship might lead to undesired consequences, e. g. to large traces in scenarios where batches are published. Some further attention needs to be paid to those scenarios when the changes proposed in this OTEP are merged into the semantic conventions.

So maybe this can be left as a configuration option? Off by default and opt-in if you want the relationship added, or on by default and opt-out if the traces in your particular case end up being too big.

Glad to write some code (possibly taking the Sidekiq instrumentation as model), do some tests and open a MR if you guys don't have the bandwidth.

no-reply commented 1 year ago

i looked at this a bit today, attempting to get an implementation that supports the SHOULD and MAY mentioned in https://github.com/open-telemetry/opentelemetry-ruby-contrib/issues/526#issuecomment-1610784816

i've written up the issues i ran into in a commit message along with my best-effort implementation:

https://github.com/open-telemetry/opentelemetry-ruby-contrib/compare/open-telemetry:opentelemetry-ruby-contrib:main...no-reply:opentelemetry-ruby-contrib:bunny-prop


importantly, it's not true that the current implementation meets the SHOULD. instead, it adds the recommended links to the "Process" span. as described in the OTEP, it's probably not possible to add them to the "Receive" span without specification changes; the needed headers are fetched during that span.

i don't see a clear way forward at this point. we could always add a configuration flag, following the sidekiq instrumentation, and add the context as parent for the "Process" span. this will orphan "Process" from "Receive" though, and that seems like a pretty big downside.

if we can decide on a direction, i'm motivated to get an implementation. as it stands, i'm feeling pretty stuck.


Edit: it was easy enough to work up a version of this that does the thing i'm convinced isn't a good idea, so here: https://github.com/open-telemetry/opentelemetry-ruby-contrib/compare/main...no-reply:bunny-prop-2

lastly, the sidekiq instrumentation avoids this whole issue since it doesn't have a "Receive" span. according the linked OTEP, the sidekiq instrumentation SHOULD have a "Receive" span; it's "Process" span currently omits all the execution related to pulling a job out of Redis and handing it off to the middleware stack.

bolandoboy commented 1 year ago

Actually, AMQP/bunny following a push model, I think the new implementation should have just a deliver span encompassing the handling of messages to the application code according to the new OTEP, right? (Plus a settle span for acknowledgements, but that's another story.)

Coincidentally, I was going to open a new issue to discuss the Bunny::ReaderLoop#run_once and receive spans, for a number of reasons, and propose to remove them and revamp the instrumentation to meet the new semantic conventions, simplifying it considerably along the way.

What do you guys think it's the best place to discuss the whole situation? This issue, a new one, some other place/mechanism better suited for discussing "big changes"...?

Edit: @no-reply If we manage to make a deliver span the first one in the consumer, it should be easier to always link it to the last producer span, as the SHOULD says, and optionally add a parent/child relationship, right?

no-reply commented 1 year ago

Edit: @no-reply If we manage to make a deliver span the first one in the consumer, it should be easier to always link it to the last producer span, as the SHOULD says, and optionally add a parent/child relationship, right?

i think it depends on details about the new implementation, but my guess is it won't be any easier. the structure of the problem is that we need to open the "Deliver"/"Receive" span before delivery/reception actually takes place, but won't have the data necessary to link and/or propagate context until that process is substantially completed.

the larger reorganization you're suggesting makes sense to me (having only recently started looking at this library seriously), but i think it probably doesn't change the ground for this discussion either way.

arielvalentin commented 1 year ago

What do you guys think it's the best place to discuss the whole situation? This issue, a new one, some other place/mechanism better suited for discussing "big changes"...?

I will leave this decision to you all. Please let the @open-telemetry/ruby-contrib-approvers know how you intend on collaborating and we will support you all with feedback on design and in code reviews.

Also, if you all are able to join the community SIG meeting to discuss any details via zoom you are most welcome! Please check the OTel Community Calendar for meeting times.

Thank you very much to @no-reply @bolandoboy @mugli for your willingness to collaborate and help us improve this instrumentation!

bolandoboy commented 1 year ago

i think it depends on details about the new implementation, but my guess is it won't be any easier. the structure of the problem is that we need to open the "Deliver"/"Receive" span before delivery/reception actually takes place, but won't have the data necessary to link and/or propagate context until that process is substantially completed.

the larger reorganization you're suggesting makes sense to me (having only recently started looking at this library seriously), but i think it probably doesn't change the ground for this discussion either way.

Let me play this weekend a bit (yes, I have strange hobbies :-D), because I'm almost sure that at the point where messages are delivered to application code, we can access the message headers where the context propagated from the producer comes (IIRC).

Hopefully I can come back with an initial implementation, and then we can follow the steps to move it through the channels @arielvalentin mentioned above. (And if not, I'll leave it to more knowledgeable contributors.)

ericmustin commented 1 year ago

@bolandoboy "But if it be a sin to covet ~honour~ time on the wknds to work on OSS, I am the most offending soul alive"

no-reply commented 1 year ago

Let me play this weekend a bit (yes, I have strange hobbies :-D)

sounds good @bolandoboy.

Also, if you all are able to join the community SIG meeting to discuss any details via zoom you are most welcome! Please check the OTel Community Calendar for meeting times.

thanks @arielvalentin! i've added this to my calendar and will try to make it next week, if only to say hi.

bolandoboy commented 1 year ago

Finally managed to move some code around and get preliminary results.

(Slight tangent: I couldn't get the example in the repo to export traces to the collector + Jaeger. Started getting protobuf errors and eventually segmentation faults on a Mac M1. I had to end up using a non-Alpine Ruby image.)

I changed the example to use q.subscribe instead of q.pop, just because I'm more familiar with the push mode of AMQP and I knew better what methods to patch, but the change should be trivial for pop.

Removing the loop_once and receive spans, and sticking to spans the OTEP says (publish and deliver), a single message triggers the creation of 2 different traces, with a link from deliver to publish.

1 1 overview

        

The publish trace:

1 2 publish

        

The deliver trace:

1 3 deliver

        

The other option, which currently I toggle very dirtily commenting/uncommenting a particular line, but could be easily configurable, is to make the deliver span a child of publish. In this case only a single trace is created.

2 1 overview

        

The single publish + deliver trace:

2 2 publish-deliver

        

Code is dirty and sitting in my machine. I haven't even ran the tests and I must have broken most of them. If somebody is interested I can push the changes, but I think the point was seeing that this can be done, and then decide if we want to go down this route.

I'll try to attend the next SIG meeting. Thanks @arielvalentin!!!

bolandoboy commented 1 year ago

@no-reply In this branch there's some code that tentatively meets the future semantic conventions from OTEP 220: https://github.com/bolandoboy/opentelemetry-ruby-contrib/tree/otep-0220-messaging-semantic-conventions-span-structure

Changes with respect to the current version:

This is a draft. Things left for even considering releasing this (apart from the OTEP being merge in the spec):

github-actions[bot] commented 1 year ago

👋 This issue has been marked as stale because it has been open with no activity. You can: comment on the issue or remove the stale label to hold stale off for a while, add the keep label to hold stale off permanently, or do nothing. If you do nothing this issue will be closed eventually by the stale bot.

ericmustin commented 1 year ago

added the keep label here. Seems like open question still, unless this should've been closed per some context from sig mtgs or is better tracked via another issue.

sokratisg commented 1 month ago

Did this manage to get through allowing for a consumer trace to be part of the parent deliver context? Reached this point after reading the equivalent option of sidekiq, which seems pretty handy if you want to have all spans in a single view/graph