Open Dr-Emann opened 4 months ago
See also:
@Dr-Emann Do you think is the instrumentation fixable or the service is untraceable?
Kinda neither: I believe the idea of automatic instrumentation of "the application-level processing of a message received from SQS" is not fixable, but I think the service is traceable, just not automatically at that level.
I think automatic instrumentation attempting to capture time spent handling messages from SQS impossible to get right in the general case, and downsides of doing it wrong are unavoidable, and disastrous (not "oops we just can't auto instrument that span", but "oops we corrupt all future spans emitted").
This is a strong statement, but I believe it to be true: the boto3sqs instrumentation is fundamentally flawed, and cannot be fixed.
The Boto3-sqs instrumentation attempts to automatically instrument code which attempts to use SQS messages. The way it does this is:
recieve_message
boto3 call, which retrieves up to 10 messages from the queuedelete_message
boto3 function is called, it ends the corresponding span if it finds one for that messageIssues
This leads to some issues (partly copied from https://github.com/open-telemetry/opentelemetry-python-contrib/pull/1599#issuecomment-1429657859):
delete_message
current_context_token
is used, if multiple threads are performing SQS operations, they will overwrite the current token, threads will end each others spans, etc.The otel context stuff is built heavily on the assumption that any context operations will be fully nested, not overlapping. Any context operations (like wrapping in a new span) around the
delete_message
call will severely break this assumptionNote: This is a fundamental flaw of the fact that the boto3sqs instrumentation attempts to use spans which are not lexically scoped, around arbitrary calling code.
Some of these issues are just fixable bugs, but some of them (especially the last) I believe are fundamental, insurmountable issues with the very concept of attempting to automatically instrument this process.