open-telemetry / opentelemetry-dotnet-contrib

This repository contains set of components extending functionality of the OpenTelemetry .NET SDK. Instrumentation libraries, exporters, and other components can find their home here.
https://opentelemetry.io
Apache License 2.0
476 stars 283 forks source link

[bug] AWS instrumentation not adding request specific info on requests *not* sampled #2345

Open cfbao opened 3 days ago

cfbao commented 3 days ago

Component

OpenTelemetry.Instrumentation.AWS

Package Version

Package Name Version
OpenTelemetry.Extensions.Hosting 1.10.0
OpenTelemetry.Instrumentation.AWS 1.10.0-beta.1

Runtime Version

net9.0

Description

Context propagation should happen on all traces, even ones that are not sampled, so downstream services know to not sample their portion of the workflow too.

This isn't happening with OpenTelemetry.Instrumentation.AWS. E.g. SQS messages from an operation won't get the traceparent message attribute if that operation isn't sampled.

Steps to Reproduce

using Amazon.SQS;
using Amazon.SQS.Model;
using OpenTelemetry.Trace;

var builder = Host.CreateApplicationBuilder(args);
builder.Services.AddOpenTelemetry().WithTracing(builder => builder
    .AddAWSInstrumentation(o => o.SuppressDownstreamInstrumentation = true)
    .SetSampler(new TraceIdRatioBasedSampler(0.5))
    .AddSource("SqsOtel"));
builder.Services
    .AddSingleton<IAmazonSQS, AmazonSQSClient>()
    .AddHostedService<Worker>()
var host = builder.Build();
host.Run();

class Worker(IAmazonSQS sqs) : BackgroundService
{
    private const string QueueUrl = "<my-queue-url>";

    protected override async Task ExecuteAsync(CancellationToken stoppingToken)
    {
        while (!stoppingToken.IsCancellationRequested)
        {
            var sendResponse = await sqs.SendMessageAsync(new()
            {
                QueueUrl = QueueUrl,
                MessageBody = "Hello from SQS!",
            }, stoppingToken);

            Console.WriteLine($"""
                sent:
                    {sendResponse.MessageId}
                """);

            var receiveResponse = await sqs.ReceiveMessageAsync(new ReceiveMessageRequest
            {
                QueueUrl = QueueUrl,
                MaxNumberOfMessages = 10,
                WaitTimeSeconds = 20,
                MessageAttributeNames = ["All"],
            });

            foreach (var msg in receiveResponse.Messages)
            {
                Console.WriteLine($"""
                    received:
                        {msg.MessageId}
                        {msg.MessageAttributes.Count}
                        {msg.MessageAttributes.GetValueOrDefault("traceparent")?.StringValue}
                    """);

                await sqs.DeleteMessageAsync(new()
                {
                    QueueUrl = QueueUrl,
                    ReceiptHandle = msg.ReceiptHandle,
                }, stoppingToken);
            }

            await Task.Delay(1000, stoppingToken);
        }
    }
}

Expected Result

We see a "traceparent" attribute in the message for all messages regardless of sampling decision, with console output like this:

sent:
        07252ddd-82c8-494a-8c62-3f139759e53f
received:
        07252ddd-82c8-494a-8c62-3f139759e53f
        1
        00-e4d6f996fd72ce140f1477be875ff351-a9a387e9cf8d2ef2-00
sent:
        5b0cf5bb-9224-46c3-b637-8586703838b9
received:
        5b0cf5bb-9224-46c3-b637-8586703838b9
        1
        00-cb7fefed64c41d386b7ea4d0062ef93a-0eed7d24a469dab6-01

Notice the "00" suffix in the first traceparent value, indicating that it's not sampled

Actual Result

We don't see message attributes on requests that are not sampled. Console output is like this:

sent:
        cfd3b57a-06d4-4bd8-8039-2ec4e8acab33
received:
        cfd3b57a-06d4-4bd8-8039-2ec4e8acab33
        1
        00-12a0dfe5dbe6bc8f7a24432c614dec1b-d8a549c0140fa810-01
sent:
        9af046d7-93a7-42aa-a6d2-aecb823240b0
received:
        9af046d7-93a7-42aa-a6d2-aecb823240b0
        0

sent:
        9cbf8764-3e57-466e-aa49-8afd78d33410
received:
        9cbf8764-3e57-466e-aa49-8afd78d33410
        0

sent:
        cc638f17-d8b8-4140-b506-e8d04f46bd9f
received:
        cc638f17-d8b8-4140-b506-e8d04f46bd9f
        1
        00-355f6b12c9cba9177a1dcccb7939a5bc-9746e7ca070cffb4-01

Additional Context

.NET doesn't really make it easy to propagate context on traces that are not sampled. (HttpClient has its own complex logic to deal with this)

There is discussion on this general problem here: https://github.com/open-telemetry/opentelemetry-dotnet/issues/2887

I've added my thoughts on the best practice pattern here, with reference to a sample app in opentelemetry-dotnet: https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/examples/MicroserviceExample

github-actions[bot] commented 3 days ago

Tagging component owner(s).

@srprash @ppittle @muhammad-othman @rypdal @Oberon00