open-telemetry / semantic-conventions

Defines standards for generating consistent, accessible telemetry across a variety of domains
Apache License 2.0
247 stars 158 forks source link

Messaging: should receive spans be CLIENT? #1366

Open lmolkova opened 3 weeks ago

lmolkova commented 3 weeks ago

Receive spans describe pulling messages from a topic/queue.

E.g. AWS SQS example looks like

List<Message> messages = sqs.receiveMessage(queueUrl).getMessages();

Kafka example

ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));

This operation fits into a vague CLIENT span definition - it's a logical client call to the remote service. It's initiated by the application itself, ends once the corresponding method return received messages and does not account for any message handling or processing time.

But we currently specify that receive spans should be CONSUMER - https://github.com/open-telemetry/semantic-conventions/blob/3c16c802e8ae8849ae0cf31eac02c3cabf64e4dd/docs/messaging/messaging-spans.md?plain=1#L213

Why it's CONSUMER?

The receive operation is the only messaging span that instrumentation libraries can guarantee to be created on the consumer side when messages are pulled.

If there is a higher level framework that is used to process messages (such as Spring or Apache Camel) it may create processing SERVER spans, otherwise they may be created by user applications.

The CONSUMER kind on the receive spans

See https://github.com/open-telemetry/oteps/blob/main/text/trace/0220-messaging-semantic-conventions-span-structure.md#span-kind for the context

lmolkova commented 3 weeks ago

I think we have two options:

Option 1: Keep CONSUMER span kind

We'd need to add more wiggle room in already vague span kind definition to make this more legit.

By using CONSUMER we create ambiguity: the receive span does not describe external request, its latency does not represent processing duration, errors don't represent processing errors. But any tool that makes generic assumptions based on the span kind alone will think that it describes message consumption.

Option 2: Use CLIENT kind

Possible drawbacks:

We can try to address any possible drawbacks with additional semantics:


My proposal is to do Option 2.

Applications that only report receive spans have poor observability - they need to instrument message consumption anyway. We're trying to cover it up by reporting CONSUMER span, but it does not solve the bigger problem.

joaopgrassi commented 2 weeks ago

We discussed this in the meeting on 30-08-2024 and reached the consensus to use CLIENT for the receive span and keep CONSUMER for when process spans are created.

pyohannes commented 1 week ago

Given changes in https://github.com/open-telemetry/opentelemetry-specification/pull/4178, this makes sense.

With those changes, we don't see the consumer span as the end point of an asynchronous communication channel (from the point of view of application code), but as "processing of an operation initiated by a producer".

This brings some limitations, but reduces ambiguity.