Messaging: should receive spans be CLIENT?

lmolkova commented 3 weeks ago

Receive spans describe pulling messages from a topic/queue.

E.g. AWS SQS example looks like

List<Message> messages = sqs.receiveMessage(queueUrl).getMessages();

Kafka example

ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));

This operation fits into a vague CLIENT span definition - it's a logical client call to the remote service. It's initiated by the application itself, ends once the corresponding method return received messages and does not account for any message handling or processing time.

But we currently specify that receive spans should be CONSUMER - https://github.com/open-telemetry/semantic-conventions/blob/3c16c802e8ae8849ae0cf31eac02c3cabf64e4dd/docs/messaging/messaging-spans.md?plain=1#L213

Why it's CONSUMER?

The receive operation is the only messaging span that instrumentation libraries can guarantee to be created on the consumer side when messages are pulled.

If there is a higher level framework that is used to process messages (such as Spring or Apache Camel) it may create processing SERVER spans, otherwise they may be created by user applications.

The CONSUMER kind on the receive spans

describes message flow direction - from broker to service (rather than call direction from service to broker)
provides an indication to tracing tools that links on this span represent incoming messages

See https://github.com/open-telemetry/oteps/blob/main/text/trace/0220-messaging-semantic-conventions-span-structure.md#span-kind for the context

lmolkova commented 3 weeks ago

I think we have two options:

Option 1: Keep `CONSUMER` span kind

We'd need to add more wiggle room in already vague span kind definition to make this more legit.

By using CONSUMER we create ambiguity: the receive span does not describe external request, its latency does not represent processing duration, errors don't represent processing errors. But any tool that makes generic assumptions based on the span kind alone will think that it describes message consumption.

Option 2: Use `CLIENT` kind

Possible drawbacks:

some consumer applications will not have any CONSUMER or SERVER spans. i.e. service maps will not detect any incoming calls to the service. This could happen in other cases (when there is no server instrumentation), so tracing systems should be prepared for it.
there will be no CONSUMER span matching PRODUCER spans - that's also does not seem like a trace visualization/analysis problem

We can try to address any possible drawbacks with additional semantics:

we already capture messaging.operation.type = receive attribute, so messaging-aware visualizations/queries should be able to special-case it
assuming we need generic solution, we can look into alternatives such as adding span link direction.

My proposal is to do Option 2.

Applications that only report receive spans have poor observability - they need to instrument message consumption anyway. We're trying to cover it up by reporting CONSUMER span, but it does not solve the bigger problem.

joaopgrassi commented 2 weeks ago

We discussed this in the meeting on 30-08-2024 and reached the consensus to use CLIENT for the receive span and keep CONSUMER for when process spans are created.

pyohannes commented 1 week ago

Given changes in https://github.com/open-telemetry/opentelemetry-specification/pull/4178, this makes sense.

With those changes, we don't see the consumer span as the end point of an asynchronous communication channel (from the point of view of application code), but as "processing of an operation initiated by a producer".

This brings some limitations, but reduces ambiguity.

open-telemetry / semantic-conventions