open-telemetry / semantic-conventions

Defines standards for generating consistent, accessible telemetry across a variety of domains
Apache License 2.0
241 stars 151 forks source link

Integrate OTEP-0220 Span structure for messaging scenarios into the messaging conventions #198

Closed joaopgrassi closed 9 months ago

joaopgrassi commented 1 year ago

The OTEP 220 which the messaging working group was focused on for the past months/year has been merged. We need to now integrate the text from there into the messaging specification.

Question still to answer: Do we want to integrate it with a "big bang" (bring all text in a single PR) or split it in portions so we send multiple, smaller PRs?

cc @lmolkova @pyohannes

lmolkova commented 1 year ago

Discussed at Messaging SIG on 7/20 - the suggestion is to break it down into multiple PRs.

Here's a proposal on how to split:

  1. Update receive operation:
    • update the definition
    • clarify links usage (SHOULD instead of MAY), describe temporary hack to postpone span creation
    • introduce link attributes
    • incorporate existing examples in the spec with ones in the OTEP, preserving link attributes
  2. Introduce deliver operation:
    • define it
    • change process to deliver when appropriate
    • callout difference between processing and delivery and clarify that processing is an application concern
    • remove formal process operation
    • incorporate existing examples in the spec with ones in the OTEP, preserving link attributes
  3. Introduce create operation
    • define it
    • describe batch-publishing
  4. Introduce settle operation - https://github.com/open-telemetry/semantic-conventions/issues/1162
  5. End-to-end examples and cleanup:
    • bring other examples from the OTEP
    • check that specific system, FaaS, and AWS Lambda examples are up to date

Each PR should add corresponding examples from the OTEP when possible.

pyohannes commented 1 year ago

Thanks @lmolkova, that looks good to me.

Some questions we can maybe discuss in today's workgroup call:

  • clarify links usage (SHOULD instead of MAY), describe temporary hack to postpone span creation

I would hold off with describing the hack, although that would be user-friendly. Firstly, it doesn't look that good and might not put a good light on the reputation of the project. Secondly, there's a danger that people take it as an argument against adding support for adding links after span creation time (as there would be a documented alternative solution).

Each PR should add corresponding examples from the OTEP when possible.

I'd also like to discuss how we proceed with examples. I'd rather minimize the use of examples, maybe we can defer to the OTEP? Initially they were very important, as they the conventions itself didn't contain much information about span structure. However, with the changes we're about to make here, the span structure should be apparent from the conventions itself, and examples should be less important.

joaopgrassi commented 1 year ago

I agree about the hack part but have some opposite view about the examples. I think they are helpful and complimentary with the text and I'd love to keep them also in the spec.

pyohannes commented 1 year ago

I'd also like to discuss how we proceed with examples.

From the WG discussion:

andrejpk commented 1 year ago

Looks great! I'm going to chew on this a bit more but a few quick thoughts:

  1. I like and understand the deliver vs create operation concept but on my first scan of the diagrams, deliver was confusing because it was in context of the receiving system. receive may be more obvious in context of the consumer (but that's probably an overloaded term so another choice might be better)
  2. When tuning traces on a recent project, it was extremely helpful to have parent map with causality and links for added context. There's a light suggestion to do so here but the examples should default to that.
  3. In diagram "A producer creates and publishes a single message, it is delivered as part of a batch of messages to a consumer", it looks like the traces all fan back through that deliver operation. I guess that's the best we can do if that whole batch is sent w one context but for an ops person trying to measure performance or track a problem, this fan-in is a major stumbling block. At the app level, it may be necessary to embed per-message tracing data to restore that 1:1 link over the batching operation. But if an instrumented library is handling that batching, then it would be a nice recommendation to parent Process m1 to Publish m1` so that doesn't fall on the app developer.
  4. I agree with @joaopgrassi -- the examples are very helpful. I suggest they are numbered if kept in the docs for easy linking and reference.