rabbitmq / rabbitmq-dotnet-client

RabbitMQ .NET client for .NET Standard 2.0+ and .NET 4.6.2+
https://www.rabbitmq.com/dotnet.html
Other
2.09k stars 586 forks source link

OpenTelemetry: Update messaging.operation span attribute to latest OTel Semantic Conventions #1715

Open iinuwa opened 3 weeks ago

iinuwa commented 3 weeks ago

Is your feature request related to a problem? Please describe.

The Activities in RabbitMQ client currently emit the messaging.operation attribute. However, the Semantic Conventions for messaging currently specify use messaging.operation.type for the values that are currently being set there, and to add messaging.operation.name to values like ack, nack, etc.

Describe the solution you'd like

At minimum, changing the messaging.operation attribute to messaging.operation.type would be helpful to comply with the conventions.

Adding messaging.operation.name would also be helpful for tracking the number of/searching for acks vs. nacks in the system. I'm not sure if there's a good place to add this in this library though; perhaps it would need to be done in higher level libraries.

Additional context

References https://github.com/rabbitmq/rabbitmq-dotnet-client/pull/1528.

michaelklishin commented 3 weeks ago

@lukebakken sounds like a reasonable last minute 7.0 change? 😢

michaelklishin commented 3 weeks ago

@iinuwa you are welcome to submit a PR that would rename the attribute. Whether it can ship in 7.0.0, I don't know. It sounds like the right moment to do it, since this feature is very new (in general and new in 7.x).

alrz commented 3 weeks ago

I'm not sure if there's a good place to add this in this library though

That would be actually quite useful if supported built-in as some generic .net metrics.

michaelklishin commented 3 weeks ago

It's surprising that messaging.operation.name reflects an operation type and messaging.operation.type is… for I'm not sure what, a different type.

In any case, renaming a metric for 7.0 sounds acceptable. Introduction of an another metric will have to go into 7.1, you are welcome to submit PRs.

iinuwa commented 3 weeks ago

Thanks Michael.

It's surprising that messaging.operation.name reflects an operation type and messaging.operation.type is… for I'm not sure what, a different type.

Yeah, I'm kind of confused too. I think type is supposed to be generic/constrained for creating dashboards across different messaging systems and name can be whatever you want. 🤷🏿‍♂️

In any case, renaming a metric for 7.0 sounds acceptable.

Added #1716 for changing the attribute name to messaging.operation.type. I also updated the corresponding attribute and operation name values to non-deprecated values since it seemed related.

michaelklishin commented 3 weeks ago

OpenTelemetry support was originally contributed in https://github.com/rabbitmq/rabbitmq-dotnet-client/pull/1528 by @stebet, @lmolkova and others. Perhaps we should also ask for their opinion.

lukebakken commented 3 weeks ago

I am pinging these people because they all reviewed #1528 -

@danielmarbach

@lmolkova

@martinjt

@Kielek

@cijothomas

@WeihanLi

lmolkova commented 3 weeks ago

It's surprising that messaging.operation.name reflects an operation type and messaging.operation.type is… for I'm not sure what, a different type.

Yeah, I'm kind of confused too. I think type is supposed to be generic/constrained for creating dashboards across different messaging systems and name can be whatever you want.

This is exactly the point. We want systems to have operation names that use their own terminology, but we also need unified type that's common across different systems. The latter one is useful for common dashboards, alerts and helps to move between messaging systems.

There could also be multiple different operation names with the same type. E.g. Azure ServiceBus has receive, peek and receive_deferred operation names which all map to receive type. Or there could be some operations like admin ones that have a name, but don't have a type since it does not make much sense to unify them across systems.

I hear you both that it's confusing. Let me know if you have any thoughts on how to make it less confusing :)

PS: will take a look at https://github.com/rabbitmq/rabbitmq-dotnet-client/pull/1716

iinuwa commented 3 weeks ago

I'll take a look this morning, thanks!

michaelklishin commented 3 weeks ago

@lmolkova this is probably not the right place to discuss this but messaging.operation.type seems to cover both messaging and topology operations, which is not wrong per se but then why does it have create but not delete? If you declare a queue or a stream, they won't necessarily live forever.

As for .name, it is a parallel track of operations. If I was designing this today, I'd use

  1. messaging.operation.type: create | delete | publish (or send) | deliver (or receive) | ack | nack | message_return (a message was returned) | consumer_registration | consumer_cancelation
  2. messaging.operation.subtype then would replace messaging.operation.name as a field that can contain arbitrary information

In any case, our team understands that it's not a decision our team will probably influence a great deal, so we will follow the current naming scheme even if it leaves a bit to be desired.

lmolkova commented 3 weeks ago

Thank for the feedback! You can actually influence our design quite a lot - this is a perfect time for it (we plan to stabilize sometime soon and after that breaking changes will have very high bar).

why does it have create but not delete. If you declare a queue or a stream, they won't necessarily live forever.

The messaging.operation.type only covers operations for which we see a strong need for unification. E.g. you create a dashboard that shows messaging throughput. You can get number of incoming messages by filtering by operation type == send (plus any other dimensions) - you can do this whether you use RabbitMQ or Active MQ.

When tracing backends visualize stuff, they need to know some things about these special operations to visualize them nicely.

But also, if I use e.g. Azure ServiceBus, having a common name send is not covering all cases. e.g. I can schedule a message, so I need something else to show the real operation name.

When it comes to queue/consumer deletion - these operations seems to be less of an interest when it comes to visualizations/common dashboards at least at the moment. I.e. you can have an arbitrary, broker-friendly name and not having a type defined won't hurt anyone. We can definitely grow a list of recognized operation types over time and include things like consumer_registration.

As for .name, it is a parallel track of operations. If I was designing this today,

Do you feel that the messaging.operation.name is confusing in some way? Why would you prefer the messaging.operation.subtype ?

we picked this name because it aligns well with similar notions we have for other technologies (e.g. we have db.operation.name or gen_ai.operation.name). It's more than a subtype - that's actually the primary thing people who use your library/system know about their operation.

messaging.operation.type: create | delete | publish (or send) | deliver (or receive) | ack | nack | message_return (a message was returned) | consumer_registration | consumer_cancelation

That's where things get tricky with operation types. They need to apply to most of the messaging systems. E.g. ack/nack/checkpoint/commit/completion/abandonment/dead lettering are all forms of settlement. For this one we want to unify since it's such an important part (you want to know how message processing has ended).

As I mentioned above, consumer_registration|cancellation|etc could be added incrementally.

In any case, happy to hear your thoughts

michaelklishin commented 3 weeks ago

@lmolkova for RabbitMQ specifically, high topology churn (due to questionable client library usage or design choices in applications) has been a serious enough headache over more than a decade. So declaration and deletion events are an important aspect to monitor, not just message publishing or consumption rates.

These days RabbitMQ has three related metrics specifically so that operators can spot topology churning applications. Perhaps OpenTelemetry for messaging should provide a couple of well-known operation types.

michaelklishin commented 3 weeks ago

Do you feel that the messaging.operation.name is confusing in some way? Why would you prefer > the messaging.operation.subtype ?

Only because to me an operation does not have a name, it has a type (or method/frame, in messaging protocol nerd speak).

It's more than a subtype - that's actually the primary thing people who use your library/system know about their operation.

The docs do not state that strongly enough.

They need to apply to most of the messaging systems. E.g. ack/nack/checkpoint/commit/completion/abandonment/dead lettering are all forms of settlement. > For this one we want to unify since it's such an important part (you want to know how message processing has ended).

I understand that and I've tried to not use terms that would be too specific to RabbitMQ or its original protocol. "settlement" from AMQP 1.0 covers a lot of ground but must be clarified, hence the .name.

The current naming, with the above explanation, should be enough for us to continue. But at the very least the purpose of .name must be clarified in the docs, with a few examples for AMQP 1.0, MQTT, AMQP 0-9-1, RabbitMQ Stream and Kafka protocols.

I'm willing to provide a few examples for the first four in case you accept external contributions :)

michaelklishin commented 3 weeks ago

So my current understand is the following:

This client can work with this. Thank you for the clarification @lmolkova.

lmolkova commented 3 weeks ago

Thanks a lot for the details @michaelklishin!

I'll go through the points you raised and convert them into work items in semantic conventions later today (or if you want to do it - go ahead)

I'm willing to provide a few examples for the first four in case you accept external contributions :)

we do accept contributions and really appreciate them at https://github.com/open-telemetry/semantic-conventions - lmk if you need any help with it.

michaelklishin commented 3 weeks ago

@lmolkova I can only use examples from the five protocols supported by RabbitMQ, including AMQP 0-9-1 which one OTel "committee big shot" at would vehemently object to. If you think that's still a good idea, should I submit a PR with updates of this file specifically? https://github.com/open-telemetry/semantic-conventions/blob/14c1d79ade64a5fddccf337fc2b00e750e110c59/model/messaging/common.yaml#L18

iinuwa commented 3 weeks ago

Thanks for the discussion @michaelklishin and @lmolkova! I have some separate considerations

Release considerations

Now that I'm looking at this a bit more, I wonder if we should just leave this as is for 7.0. I don't want to delay the release of this client, since the async and performance changes are much more important, and @lukebakken has already indicated that adding new attributes should be pushed to 7.1. I'm thinking we should defer these operation name and attribute name and value changes to 7.1 and make sure to coordinate operation names with other major libraries; at least the ones that the RabbitMQ team officially supports.

A caveat: changing the operation and attribute names are technically breaking changes, since it will mess with people's dashboards. But the RabbitMQ and Messaging Semantic Conventions are marked as experimental, so perhaps it would be OK to break this in a minor RabbitMQ client version. (Maybe it'd be good to denote that in the release notes that OpenTelemetry schema in the RabbitMQ .NET client is also experimental and subject to change.)

Naming things

A quick sample of operation names for a few clients:

Language Library Built-In/Wrapped RabbitMQ-Supported Values
Java rabbitmq-java-client Wrapped Yes send, receive
Node amqplib Wrapped No publish, consume, ack, nack, reject
Python pika Wrapped Yes send, receive
Go amqpgo None* Yes (none yet, in progress)
Erlang rabbitmq-erlang-client None Yes N/A
PHP php-amqplib None Yes N/A

All of these use the old messaging.operation attribute (except the Go one, which is still in progress).

My recommendations for the names would be to use the AMQP operation names, stripping basic. : AMQP/RMQ operation OTEL Operation Type OTEL Operation Name Description
basic.publish send publish sending a message
basic.deliver process deliver receiving a message from a subscription initiated by basic.consume
basic.get receive get pulling an message with basic.get
basic.consume (none) consume start a subscription with basic.consume
basic.ack settle ack Mark a message as processed
basic.nack settle nack return a message back to broker
basic.reject settle reject return a message back to broker

(Alternatively, we can replace periods . with underscores _, e.g. basic_get. But this is lengthy, and I don't think AMQP will add another prefix in the future, so I prefer the short one. It's also closer to what the RabbitMQ and messaging ecosystem at large is using)

michaelklishin commented 3 weeks ago

@iinuwa we will accept the changes you have submitted, they are small and bring us up-to-date with the current state of OpenTelemetry naming conventions. All other changes will have to wait until 7.1.

iinuwa commented 3 weeks ago

@michaelklishin, OK. I'll split off the other work into a separate PR then. Thanks!

michaelklishin commented 3 weeks ago

@iinuwa by the way, the idea of stripping basic. for operation names sounds good to me except for basic.get which is explicitly called out in the docs and management UI, for several reasons. Alternatively we can use "polling fetch" for operation name.