Dr-Emann commented 4 months ago

This is a strong statement, but I believe it to be true: the boto3sqs instrumentation is fundamentally flawed, and cannot be fixed.

The Boto3-sqs instrumentation attempts to automatically instrument code which attempts to use SQS messages. The way it does this is:

Messages are recieved in a batch via the recieve_message boto3 call, which retrieves up to 10 messages from the queue
It starts a span for every message received in the batch
When the message is accessed (via indexing the returned list of messages, or iterating over the returned list of messages), the span created for that message is activated
When the delete_message boto3 function is called, it ends the corresponding span if it finds one for that message

Issues

This leads to some issues (partly copied from https://github.com/open-telemetry/opentelemetry-python-contrib/pull/1599#issuecomment-1429657859):

All spans for a batch of messages are started at the same time, rather than starting when processing for that individual message occcurs
To determine the end of a span the instrumentation relies that user code calls sqs.delete_message API. In case of an exception in the user code delete_message might never be called, leaving behind an unended span. Not calling delete_message should be fine. It may be that this application sends the message elsewhere, and that other component is in charge of delete the message, for instance. In this case, memory use will grow without bound!
There are many ways for user code to not match the expected use of boto3 which will result in various levels of misbehavior. For example:
1. Using multiprocessing to handle each message in parallel: a totally different process will end up calling delete_message
2. As discussed above, not sending the delete_message call will lead to unbounded memory growth
3. Iterating the list of messages multiple times, or creating a new dict/list from the list passed, slicing the returned list
A single global current_context_token is used, if multiple threads are performing SQS operations, they will overwrite the current token, threads will end each others spans, etc.

The otel context stuff is built heavily on the assumption that any context operations will be fully nested, not overlapping. Any context operations (like wrapping in a new span) around the delete_message call will severely break this assumption

response = client.receive_message(
   QueueUrl="http://sqs.us-east-1.host.docker.internal:4566/000000000000/fpc-events-local.fifo",
   MaxNumberOfMessages=10,
)
messages = response.get("Messages", [])
# Indexing into the messages attaches a new context (1) with a new span as current, and remembers the previous context (0)
message = messages[0]
# start_as_current_span attaches a new context (2) with a new span as current and remembers the previous context (1)
with tracer.start_as_current_span("inner_span"):
   # On message delete, the current context is reset to before the indexing (0)
   client.delete_message(
       QueueUrl="http://sqs.us-east-1.host.docker.internal:4566/000000000000/fpc-events-local.fifo",
       ReceiptHandle=message["ReceiptHandle"],
   )
# On exit of the with block, the current context is reset to before the start_as_current_span (1)
# The current context now has the span for the `messages[0]` as the current span, all future spans
# will use that as their parent

Note: This is a fundamental flaw of the fact that the boto3sqs instrumentation attempts to use spans which are not lexically scoped, around arbitrary calling code.

Some of these issues are just fixable bugs, but some of them (especially the last) I believe are fundamental, insurmountable issues with the very concept of attempting to automatically instrument this process.

Dr-Emann commented 4 months ago

1702

1703

1704

xrmx commented 3 months ago

@Dr-Emann Do you think is the instrumentation fixable or the service is untraceable?

Dr-Emann commented 3 months ago

Kinda neither: I believe the idea of automatic instrumentation of "the application-level processing of a message received from SQS" is not fixable, but I think the service is traceable, just not automatically at that level.

I think automatic instrumentation attempting to capture time spent handling messages from SQS impossible to get right in the general case, and downsides of doing it wrong are unavoidable, and disastrous (not "oops we just can't auto instrument that span", but "oops we corrupt all future spans emitted").

open-telemetry / opentelemetry-python-contrib

The boto3sqs instrumentation is broken and can't be fixed #2509

Issues

1702

1703

1704