Consider allowing to set remote span as parent instead of link when propagating context via RabbitMQ

josh-romanowski commented 2 months ago

Is your feature request related to a problem? Please describe.

In the past we had our own implementation where we implemented distributed tracing with OpenTelemetry over RabbitMQ. So we were really excited to see native support of distributed tracing in the library.

We modeled our spans in a way where the span in the publisher is the parent of the spans created in consumers of a message. And we would like to continue modeling our messages in that way.

While the instrumentation library code here suggests that a parent child relationship is intended in the library, and the implementation here also suggests that at least the possibility to set the extracted remote context as parent, the actual implementation sets the extracted context as link. Also, the current public interface for the ContextExtractor also does not allow to overwrite or configure the OpenTelemetryContextExtractor in a way that would allow setting the extracted context as parent.

Describe the solution you'd like

Instead of adding the remote span as a link to the span in a consumer, add the span as a parent.

Describe alternatives you've considered

If it is not wanted to always set the remote span as a parent by default (see semantic conventions), at least give the possibility to overwrite the ContentExtractor in a way that allows setting either/or both parent and link based on the implementation. This would at least allow us to write our own extractor that sets the span as a parent.

Additional context

Semantic conventions for messaging systems

lukebakken commented 2 months ago

@josh-romanowski a PR would be welcome. Please note that I am releasing version 7 this week.

cc @stebet @lmolkova @martinjt @@lailabougria @cijothomas - thoughts? You all understand OTEL far better than I do.

lmolkova commented 2 months ago

Instead of adding the remote span as a link to the span in a consumer, add the span as a parent.

OTel semantic conventions RECOMMEND link, but it's totally valid to use the same context as a parent and a link

lukebakken commented 2 months ago

I will assume that any changes needed will not affect the public API for RabbitMQ.Client. I've put this in the 7.1.0 milestone for that reason.

Tornhoof commented 2 months ago

Yeah making it configurable would be nice, our own implementation also used a parent-child relation and at least for Grafana it's a break in the flow If you follow the traces. I'd then switch back to parent-child too.

josh-romanowski commented 2 months ago

As it appears the idea of allowing also to use the context as parent resonates, I can have look at making a PR for that.

@lukebakken I would assume making that configurable would at least require a small addition to the API of RabbitMQActivitySource. Would that be okay with you?

And the other question would be how you envision the configuration for that? Thinking about it probably adding a flag for what to use the extracted context for is more appropriate than actually extending ContentExtractor. I would assume that you would always use the same extracted context for both relations. And then rather decide what to use it for. What is your opinion?

Edit: clarifying the third section

lailabougria commented 2 months ago

I tend to agree with @Tornhoof, I would link by default, adhering to the conventions as @lmolkova pointed out, but making it configurable could help users who prefer having the parent-child relationship.

lmolkova commented 2 months ago

setting parent by default (in addition to link) is fine too - that's what we're going to do (or do already) in Azure ServiceBus EventHubs SDKs.

I believe @martinjt has the case for turning parent-child off, but (sorry Martin) he's the only one I heard from who wants links only. At the same time we've got overwhelming feedback to set parent and have one trace.

lukebakken commented 2 months ago

@josh-romanowski a PR would be welcome and yes, now is the time for API changes.

martinjt commented 2 months ago

The case for parent-child revolves around the idea that messages are sync, or close to sync. Which isn't always the case, and in my talking to a lot of Event Driven Architecture people, they confirmed that unless you're doing synchronous/transactional messaging, you shouldn't assume that.

With that in mind, having a parent/child relationship where the gap could be days break most trace waterfall views, and isn't really in keeping with the ideas behind how you model traces.

So if you're doing sync or close to synchronous messaging, then it does make sense to make them parent/child, but the default shouldn't be that, since it's then a different context you're working in, and you're not moving back up the chain with a response.

To be clear though, I've advocated for it to be configurable by the implementer. It shouldn't be both, you should be able to define, at a message type level, whether this invocation is part of a transaction/request/context, or whether this should start a new context and link, and ultimately, even whether it should be linked at all.

My personal view is that all telemetry is you, as the developer who knows your system, modelling your debug experience to make sense, therefore it should be optional, configuration, and with sensible defaults that will scale. Using links here is what should be happening by default, but the fact that it's not configurable is a super valid enhancement.

(I'm slightly worried that it's not going to be like that in the ServiceBus SDKs now @lmolkova, which would be a mistake)

lailabougria commented 2 months ago

@lmolkova @martinjt For context, we've recently made changes to NServiceBus.

The approach we have now is as follows:

If you send a message, you get parent-child - the receive operation is a child of the send operation
If you publish a message, you get links - a publish is a fan-out and has an unknown amount of subscribers, which is why we link.
If you send a delayed message (delayed delivery), you get links - this is guaranteed to happen with a delay, so a link makes sense to limit the duration of the trace

Apart from the last option, where we enforce links, we've provided APIs that allow you to deviate per send/publish to alter the behavior. So if you prefer a parent-child when publishing, you can configure that. And when you prefer linking for a send, you can do that too.

We've also seen most users ask for parent-child relationships. However, we've also gotten feedback that when everything is parent-child everywhere, you get a never-ending trace that's way too hard to understand. In a message-based system, almost every action can lead to a follow-up send/publish operation, making traces very long. This is why we provided APIs that allow users to deviate at the operation level, as they have the best context of what makes sense to include/exclude from the trace.

martinjt commented 2 months ago

@lailabougria that's exactly as I would expect the defaults in a world where you can differentiate between the modes, and having to ability override for each operation is perfect to encourage that flexibility around designing the debug experience properly for the developer's usecase.

I may have changed the send to a link, personally, but I know you'll have done the work to validate that with the users, and the fact that they can switch is totally fine.

SzymonPobiega commented 2 months ago

Let me add my two cents also

My understanding is that RabbitMQ is a bit of a special type of a messaging broker because it has some unique features that make it well suited for synchronous request-reply communication. The tricky bit is, the knowledge if a given message is supposed to be considered async or sync is available only at the site when it is created, likely at the level of a higher-level framework, such as MassTransit or NServiceBus.

Expanding on what @lailabougria wrote, the parent/link decision process in NServiceBus happens entirely on the sending side, following the logic I outlined above: it is the sender who has the context to decide, for any given message. In addition to that, NServiceBus will prioritize the decision made by the messaging SDK over its own link/parent logic if the SDK does support OpenTelemetry and if NServiceBus transport adapter has been updated to take advantage of that.

Based on that, if messaging SDK does not support OpenTelemetry, the decision to link/parent is made on the sending side with defaults being link for events and parent for commands.

If, however, the SDK does support OpenTelemetry, NServiceBus will ignore the hints of the sending framework side and use whatever strategy SDK receiver employed.

Now, coming back to my first point about the sync/async context being available only on the sending side, I would like to propose a following design:

The sending code (either the code that calls RabbitMQ directly or a code of a higher-level framework such, as MassTransit or NServiceBus) decides if a given message is to be treated as sync or async with respect to tracing. Defaults. such as events are async while commands are sync. apply here. The decision is somehow encoded in the message that is passed to the RabbitMQ SDK
RabbitMQ SDK sending side encodes this decision in the message that is dispatched. General overrides such as "always use parent" or "never use parent" might apply here.
RabbitMQ SDK receiving side decodes the link/parent decision and creates activity accordingly
The receiving code (either the code that calls RabbitMQ directly or a code of a higher-level framework, such as MassTransit or NServiceBus) uses that activity as a parent

This would allow the higher-level frameworks to make sensible decisions for the users on the receiving side regardless if the broker SDK has the OpenTelemetry support enabled or disabled.

josh-romanowski commented 2 months ago

I really appreciate the detailed discussion.

However, I have to admit that as a first time contributor some of the points are simply beyond my knowledge of how to implement them in the project here. A general overwrite would already meet the requirements I would have for a solution.

If that's ok with everyone, I'll provide a simple PR that allows a general override. For a more detailed configuration, I'd ask you guys to expand on the suggestion then. I'll be on FTO starting tomorrow and unfortunately won't have the capacity to deal with the details of a more detailed configuration until then.

ngbrown commented 2 weeks ago

I just ran into this issue. I have a system that does both fan-out and RPC, so with the RPC calls, I want the Activty.Parent to be populated from the traceparent header value. Currently, with v7.0, the result of ContextExtractor() is being passed into StartLinkedRabbitMQActivity(linkContext: ) and parentContext is never passed from any of the call sites.

My traces currently look like this and there are three separate top level trace id's when I just want one (currently the three are: send request, handle request, handle response).

Screenshot 2024-11-08 205126

Without taking into account the fan-out case where activity links might be preferred, I have a branch where I move the extracted context to the parentContext parameter and get the traces that I want.

branch: https://github.com/ngbrown/rabbitmq-dotnet-client/blob/fix-activity-parent commit: https://github.com/ngbrown/rabbitmq-dotnet-client/commit/5b713870054dfb939a75902b7202b27de1d2152c

Screenshot 2024-11-08 204741

So this should be configurable in some way. Probably on a per-message basis, and if there is an automated way to tell if the message came from a topic exchange that would be ideal (since that's the only way I pass fan-out messages). For real-time fan-out messages, even to lots of receivers, I would still rather have the normal parent/child relationship.

As an aside, this is using Serilog/SerilogTracing with the Seq web service. The https://github.com/serilog-tracing/serilog-tracing/pull/153 is still pending and my RabbitMQ adaptor for SerilogTracing is here: https://github.com/ngbrown/grpc-testing/tree/with-serilog-tracing-pr/SerilogTracing.Instrumentation.RabbitMQ.

Even if SerilogTracing understood Activity linking, the linked trace id keeps changing, so indexing on that isn't really that useful.

eerhardt commented 1 day ago

Just adding a +1 here to being able to have a parent-child relationship for the "Receive/Process" span. With the .NET Aspire Dashboard, having the parent-child relationship makes the Tracing much easier to consume/understand.

The "Links" show up in the bottom of the details of a span, but they are hard to find and see in one shot:

Compare this with the experience with Azure ServiceBus, where the order processing gets nested in the whole trace that kicked it off:

ServiceBus also has Links, so you can see them as well:

My 2 cents (for what it is worth) is I think for a normal BasicPublishAsync => AsyncEventingBasicConsumer flow, having the parent-child relationship plus Links by default makes sense, and then having an option to disable the parent-child relationship, if necessary.

rabbitmq / rabbitmq-dotnet-client