Messaging: record queue vs topic behavior correctly

lmolkova commented 1 year ago

In open-telemetry/opentelemetry-specification#3214 we're removing messaging.destination|source.kind attribute as it's not clear what messaging system behavior it captures and how we can use this information on the backend or as an end-user.

Here's the summary of @dpauls findings https://cloud-native.slack.com/archives/C02Q4AAHDSA/p1678284552744549 on what would be useful to capture through a different set of attributes.

Oberon00 commented 1 year ago

Here's the summary of @dpauls findings cloud-native.slack.com/archives/C02Q4AAHDSA/p1678284552744549 on what would be useful to capture through a different set of attributes.

Please copy the summary into the issue description. It should not be necessary to create a CNCF slack account and follow a link to understand important information about the issue.

carlosalberto commented 1 year ago

+1 to @Oberon00's suggestion.

lmolkova commented 1 year ago

Sure, here's the original comment from @dpauls with my attempts to preserve formatting

In our Messaging SIG meeting on Mar. 2, I agreed to look into the difference between queue and topic in the context of JMS, and whether this is useful in a trace. I'll summarize my findings below, but the summary is that I believe there's value, at least in JMS messaging, to indicate in some way the following attributes of destinations:

topic vs. queue (indicates point to point or PTP vs. Pub/Sub model): I think this is useful because it helps someone viewing the trace to understand whether the message might be expected to go to 0..n consumers or to a single consumer. Back ends could use this information to highlight interesting scenarios. For example:
- Multiple receive spans for the same message in the same trace might be highlighted if a PTP model is being used as this indicates the possibility of a duplicate message.
- When, in the future, intermediary tracing is formalized, published messages that go to 0 destinations might be of particular interest for the Pub/Sub model (usually in the case of P2P, this is easily identified as an error, but may or may not be in the case of Pub/Sub). Without intermediary tracing, 0 receive spans might be interesting, but this would always be the case with consumers who aren't currently connected.
durability: (i.e. temporary or durable).

I have some thoughts on why durability is useful in a trace, but that hasn't come up yet in our discussions, so I'll save that for later if it becomes contentious. If what topic vs. queue truly provides is insight into the messaging model, might it be preferable and more generic to name it: messaging.model = PubSub or PTP (exact details of names here may be up for debate; we could.) I suggest messaging.model rather than destination.model or destination.messaging_model because it is really a property of the message rather than where it is going. Alternatives to model could be style (per the JMS spec below) or pattern.

On the subject of uniqueness, I see evidence that other brokers allow queues and topics to have overlapping names. For example, this ActiveMQ page says:

Note: While it is possible to configure a JMS topic and queue with the same name, it is not a recommended configuration for use with cross protocol.

I would agree with this best practice. Using topic vs. queue to qualify source/destination uniqueness feels awkward. Since the messaging semantics conventions use SHOULD in relation name uniqueness, I don't think it's necessary to use topic vs. queue in uniqueness. Following best practices where topics and queues get their own unique names. A reference on the PTP vs. Pub/Sub model: JMS 2.0 (also described in the 1.1 spec, and probably applies to Jakarta JMS as well) section "1.1.3 JMS Domains" says:

JMS supports the two major styles of messaging provided by enterprise messaging products:

Point-to-point (PTP) messaging allows a client to send a message to another client via an intermediate abstraction called a queue. The client that sends the message sends it to a specific queue. The client that receives the message extracts it from that queue.

Publish and subscribe (pub/sub) messaging allows a client to send a message to multiple clients via an intermediate abstraction called a topic. The client that sends the message publishes it to a specific topic. The message is then delivered to all the clients that are subscribed to that topic.

dpauls commented 1 year ago

As a way of capturing the useful concept that I think destination.kind of topic vs. queue I would propose something roughly the following. Add an attribute named messaging.pattern. This value would be optional, and SHOULD (MAY?) be provided if the messaging pattern fits one of the patterns described below. pubsub: The message was published using a publish/subscribe messaging pattern, which may result in many consumers of the message. If there are zero consumers, this may be of interest to observers; perhaps this source could be shutdown or perhaps there is a problem with consuming applications that should be investigated. ptp: The message was published using a point-to-point messaging pattern, which is expected to result a single consumer of the message. If the message is received by multiple consumers, this may be of interest to observers; it indicates a possibility of duplicate message processing, which can be a problem for some messaging applications.

The terminology of pubsub and ptp is adopted from the JMS 2.0 specification section 1.1.3.

We could extend this over time if there are other messaging patterns identified where we might expect back ends to be able to possibly identify interesting traces.

If there are no significant objections to this, I could progress on preparing a PR and we could finalize details relating to wording, naming, and MAY vs. SHOULD as we go along.

tylerbenson commented 1 year ago

Including the text from the linked document:

1.1.3. JMS domains

JMS supports the two major styles of messaging provided by enterprise messaging products:

Point-to-point (PTP) messaging allows a client to send a message to another client via an intermediate abstraction called a queue. The client that sends the message sends it to a specific queue. The client that receives the message extracts it from that queue.

Publish and subscribe (pub/sub) messaging allows a client to send a message to multiple clients via an intermediate abstraction called a topic. The client that sends the message publishes it to a specific topic. The message is then delivered to all the clients that are subscribed to that topic.

I'd like to point out that each section still uses the terms queue and topic. These are very widely understood terms and I suggest we keep using them when relevant. In this case, maybe you can argue that we're trying to describe the messaging pattern where a queue or topic is a particular tool used to implement that pattern, similar to what is being done in that JMS spec. Either way, I think it is an unfortunate naming decision.

tylerbenson commented 1 year ago

Another potential concern... From an instrumentation perspective, it is much more obvious to automatically determine if a particular destination is a topic or queue. For anything that happens outside the bounds of the instrumented application -- inside the broker for example -- it is difficult to accurately depict how it should be modeled. As a result, I think we should focus more on what can be determined within an application than complete accuracy.

lmolkova commented 1 year ago

I'd like to bring us back to the discussion how it's going to be used and what it would tell. Here's some research:

Many systems support only queues or topics.
- queue only: AWS SQS, Google cloud tasks, Azure Storage Queues
- topic only: Kafka, Pulsar, RocketMQ, AWS SNS, Azure EventGrid, EventHubs, Google Pub/Sub
- mixed: RabbitMQ, JMS, Azure ServiceBus

I.e. the attribute, if introduced, would not add new information for most of the systems. It can already be assumed from messaging system name. There could still be a value in adding it (so that backends don't need to maintain the mapping).

Topic/queue distinction is not necessarily known on the producer side:
- ServiceBus topic/queue behavior on the producer is the same. On the API surface, you can publish to a topic, but it'll be sent to a corresponding entity which can be a queue.
- RabbitMQ: you only know the name of the exchange you publish to, but not its type
- JMS is an abstraction and actual messaging systems underneath (e.g. AMQP-based) not necessarily care about queue/topics terminology
Queue/Topic does not imply specific behavior.
- Apache pulsar allows to use topics as queues with exclusive subscriptions
- It's common to fork messages to multiple queues for reliability and leave it up to consumers to duplicate
- Delivering messages to the same consumer service multiple times as a retry mechanism is also common.
- It's common to route messages from queue to topic using integrations. E.g. ServiceBus (queue) to EventGrid (topic)
- Or vice versa: AWS SNS to SQS

Based on all the above, let's think about telemetry backend behavior or end-user who sees queue/topic attribute:

Queues:

message never delivered
- sampled out?
- consumer not instrumented?
- still in the queue? <- might need to be alerted depending on the application
- actual bug <- this needs to be alerted
message delivered once: all good
message delivered to 3 different services:
- maybe it was forked and forwarded? i.e. intentional
- wrong configuration? <- this is worth the attention
message delivered to 3 different service instances
- retries?
- forked message?
- competing consumers and intentional?

Topics:

message never delivered
- exactly the same options as for queues
message delivered 1+ times: without knowing how many subscriptions are there, can't say if it's right

I.e. we can only say (with some unknown level of confidence) that:

message that was sent to the queue and consumed by multiple different services looks suspicious unless user configured queue -> topic forwarding.

From my perspective, there is no general-purpose tracing analysis that could be done using queue/topic terminology without prior knowledge of the system.

I hope to see some deterministic analysis examples that apply to at least several messaging systems.

pyohannes commented 1 year ago

From my perspective, there is no general-purpose tracing analysis that could be done using queue/topic terminology without prior knowledge of the system.

In my understanding the main point here is, that with knowledge of the system the proposed attributes are redundant, except for the few cases where systems support different both queues and topics (or ptp and pubsub).

message never delivered

[...]

message delivered once: all good

message delivered to 3 different services:

[...]

message delivered to 3 different service instances

I wonder if the keyword here should be "settled" rather than "delivered"? From a simplified point of view, one could say that one expects a message in a queue to be settled successfully exactly once, whereas a message from a topic can be settled successfully several times. However, things like forked messages, fire-and-forget, checkpoint-based (batch) settlement blur this picture a lot.

Let's discuss again next Thursday. If we don't reach a consensus, I propose to postpone this discussion until we work on settlement attributes (which is a blocker for stable semantics), as I think there are some generic differences between settlements in queue- and topic-based scenarios, at least in a simplified view. If we don't come to a consensus in that context, I'd recommend making the issue a non-blocker for stable semantics.

Oberon00 commented 1 year ago

Well, having basic information about the entities interacted with is always nice. Even if you maybe can't do conclusive problem analysis, you can at least show a nice icon or group by queue vs topic

lmolkova commented 1 year ago

Well, having basic information about the entities interacted with is always nice. Even if you maybe can't do conclusive problem analysis, you can at least show a nice icon or group by queue vs topic

Great point!

So, if we have an attribute, it should be

recommended to messaging systems that support both modes - queues and topics. We can start there and see if it should have a wider scope
recommend it only when it's known - in some cases it's only known on the consumer side and it's ok for the icon - producer will also point to the same node

the last part is how to name such an attribute and if it should be a general-purpose one.

In case of Pulsar, e.g. wouldn't it make more sense to record the specific subscription type instead of saying it's a topic or queue? (p2p or pubsub). For Rabbit, wouldn't exchange_type be more precise and descriptive than topic or queue? It feels like we're trying to fit a spectrum of different behaviors into a JMS abstraction and they don't fit.

So if we're going to work on it, we should have examples of how to capture it, on which spans and how to map it to specific messaging systems. In any case, it's an additive change that can be added at any moment into the spec once there is clarity.

lmolkova commented 1 year ago

I wonder if the keyword here should be "settled" rather than "delivered"? From a simplified point of view, one could say that one expects a message in a queue to be settled successfully exactly once, whereas a message from a topic can be settled successfully several times. However, things like forked messages, fire-and-forget, checkpoint-based (batch) settlement blur this picture a lot.

I think there are more behaviors to capture:

settlement kind (offset, message, ack/nack/dead-letter)
retention (are messages removed or kept for other subscriptions)
subscription information (name, type, etc)

p2p/queue and pubsub/topic would not be enough to capture above behaviors or properties reliably.

open-telemetry / semantic-conventions

Messaging: record queue vs topic behavior correctly #1220

1.1.3. JMS domains