open-telemetry / semantic-conventions

Defines standards for generating consistent, accessible telemetry across a variety of domains
Apache License 2.0
272 stars 174 forks source link

Messaging: how to trace settle operation #1162

Open lmolkova opened 1 year ago

lmolkova commented 1 year ago

Settlement in messaging is an important operation that indicates that message was consumed.

  1. Settlement can happen in different ways:
    • on the broker (usually after the consumer acks that the message was received)
    • on the consumer automatically by the client SDK when message is delivered to user application and callback completes successfully
    • explicitly by user code
    • some messaging systems support peeking messages without actually settling them

Is it worth recording the configuration somehow? E.g. if settlement is configured to happen on consumer, and some messages are not settled at all, it's a clear issue.

  1. The settlement also means different things depending on how message processing went
    • the message is completed successfully
    • the message processing failed transitively and should be delivered
    • the message processing failed terminally (and might be moved to dead-letter queue, etc)

It usually makes sense when messages are settled individually and supported by systems like RabbitMQ or Azure Service Bus. Related question: some systems expose information like redelivery count or a boolean flag (JMS). Is it worth recording?

  1. messaging systems vary in terms of what they settle:
    • per-message
    • per offset and someone can come back and process it again
    • maybe there are other behaviors?

When offset is settled, recording offset as an attribute would be very useful and provide observability into consumer behavior.

Possible solutions:

  1. the bare minimum: individual messaging systems specify custom attributes to record applicable behaviors above
  2. define generic attributes for settlement status (p2) and settled offset (p3)

Additional context:

p1 (expressing different settlement modes) seems like an additive change with a lower priority p2 and p3 provide essential information about consumer behavior. Without them, instrumenting settlement calls does not seem useful.

pyohannes commented 1 year ago

From https://github.com/open-telemetry/oteps/pull/220#discussion_r1137764482:

wonder if it would be useful to mention that settle span represents different kinds of settlement: completion (ack), abandoning (nack), dead-lettering, etc

pyohannes commented 1 year ago

Also, the span kind of the settlement span needs to be discussed and specified. See discussion here: https://github.com/open-telemetry/oteps/pull/220/files#r1190323071

pyohannes commented 1 year ago

Triaged in the messaging workgroup.

We agreed that we want to have means to convey the settlement intent (p2 from the description above) with the first stable version of messaging semantic conventions. A separate issue was submitted for this: https://github.com/open-telemetry/semantic-conventions/issues/431

Generic conventions for settlement offsets and the settlement type (checkpoint-based or per-message) can be tackled post-stability. We'll leave this issue in place for that and triage it as post-stability.