open-telemetry / opentelemetry-dotnet

The OpenTelemetry .NET Client
https://opentelemetry.io
Apache License 2.0
3.17k stars 752 forks source link

First `Activity` is always recorded with `default(ActivityContext)` #2806

Closed Aaronontheweb closed 2 years ago

Aaronontheweb commented 2 years ago

Bug Report

Using OpenTelemetry.Api 1.2.0-rc1 and the following packages:

<PackageReference Include="OpenTelemetry.Extensions.Hosting" Version="1.0.0-rc8" />
<PackageReference Include="OpenTelemetry.Instrumentation.AspNetCore" Version="1.0.0-rc8" />
<PackageReference Include="OpenTelemetry.Instrumentation.Http" Version="1.0.0-rc8" />
<PackageReference Include="OpenTelemetry.Instrumentation.SqlClient" Version="1.0.0-rc8" />
<PackageReference Include="OpenTelemetry.Contrib.Instrumentation.EntityFrameworkCore" Version="1.0.0-beta2" />

Runtime version: ASP.NET Core 3.1 / netcore3.1

Tracer configuration looks like this:

services.AddOpenTelemetryTracing(builder =>
{
    builder.Configure((provider, providerBuilder) =>
    {
        var traces = provider.GetRequiredService<List<Activity>>();
        var exporter = new InMemoryExporter<Activity>(traces);
        var processor = new SimpleActivityExportProcessor(exporter);
        builder
            .AddAspNetCoreInstrumentation()
            .AddHttpClientInstrumentation()
            .AddSqlClientInstrumentation()
            .AddEntityFrameworkCoreInstrumentation()
            .AddSource("Akka.NET")
            .AddProcessor(processor);
    });
});

Symptom

When executing an integration test against our custom OTel instrumentation for Akka.NET, we have the following reproduction specification:

[Collection("AspNet")]
public class AspDotNetClusteredRouterIntegrationSpec : IClassFixture<CustomWebApplicationFactory<ClusterStartup>>
{
    private readonly ITestOutputHelper _output;
    private readonly CustomWebApplicationFactory<ClusterStartup> _factory;

    public AspDotNetClusteredRouterIntegrationSpec(CustomWebApplicationFactory<ClusterStartup> factory, ITestOutputHelper output)
    {
        _factory = factory;
        _output = output;
    }

    /// <summary>
    /// Bug reproduction for https://github.com/petabridge/phobos/issues/619
    /// </summary>
    [Fact]
    public async Task ShouldTraceFromControllerToChr()
    {
        // Arrange
        using var client = _factory.CreateClient();

        // Act
        var response = await client.GetAsync("test/cluster-router");

        // Assert
        response.EnsureSuccessStatusCode(); // Status Code 200-299

        var sp = _factory.Services;

        await TestHelpers.AwaitSingleTrace(sp, 3, _output);
    }
}

And we use the following assertion method to assert that all traces gathered underneath this scenario and others like it are all correctly correlated together:

public static class TestHelpers
{
    public static async Task AwaitSingleTrace(IServiceProvider sp, int spanCount, ITestOutputHelper output)
    {
        var retryCount = 5;
        while (retryCount > 0)
        {
            try
            {
                var spans = sp.GetRequiredService<List<Activity>>();
                //spans.Count(c => c.Context.TraceId != default).Should().Be(spanCount);
                spans.Count.Should().Be(spanCount);

                // validate that all spans are part of the same trace
                spans.Select(x => x.Context.TraceId).ToImmutableHashSet().Count.Should().Be(1);
                break;
            }
            catch (Exception)
            {
                retryCount--;
                if (retryCount == 0)
                {
                    var spans = sp.GetRequiredService<List<Activity>>();
                    output.WriteLine("Expected to have single traceId, but instead: [{0}]",
                        string.Join(",", spans.Select(x => x.Context.TraceId)));
                    foreach (var s in spans)
                    {
                        output.WriteLine(
                            "Operation: {0}, TraceId: {1}, SpanId: {2}, ParentId: {3}, References: {4}",
                            s.OperationName, s.Context.TraceId,
                            s.Context.SpanId, s.ParentId,
                            string.Join(",", s.Links.Select(x => $"{x.Context}")));
                    }

                    throw;
                }

                else
                    await Task.Delay(500);
            }
        }
    }

This assertion always fails for the first test in this assembly with the following exception:

Expected spans.Select(x=>x.Context.TraceId).ToImmutableHashSet().Count to be 1, but found 2.

And then this appears in the logs:

Expected to have single traceId, but instead: [b6da161aacdf4094bedadabddc28cfab,b6da161aacdf4094bedadabddc28cfab,b6da161aacdf4094bedadabddc28cfab,00000000000000000000000000000000]
Operation: akka.msg.recv String, TraceId: b6da161aacdf4094bedadabddc28cfab, SpanId: 0adf7809a48ea73b, ParentId: 00-b6da161aacdf4094bedadabddc28cfab-db22a012992ddc5f-01, References: 
Operation: akka.actor.ask string, TraceId: b6da161aacdf4094bedadabddc28cfab, SpanId: db22a012992ddc5f, ParentId: 00-b6da161aacdf4094bedadabddc28cfab-7f8f25c1adab086e-01, References: 
Operation: Index, TraceId: b6da161aacdf4094bedadabddc28cfab, SpanId: 7f8f25c1adab086e, ParentId: |5a134166-49c7fa1988cab143., References: 
Operation: Microsoft.AspNetCore.Hosting.HttpRequestIn, TraceId: 00000000000000000000000000000000, SpanId: 0000000000000000, ParentId: , References:

It appears as though the Microsoft.AspNetCore.Hosting.HttpRequestIn trace, the first in this test, is set to default(ActivityContext). If I attempt to do a manual activity using the OpenTelemetry.Api.Trace.Tracer methods, that will also show up with default(ActivityContext) under this method - so it's not the ASP.NET Core instrumentation that is at fault here.

I've been able to work around this issue by generating a fake span at the start of each test and then purging it from my in-memory collection:

// part of my Startup.cs class for arranging these integration tests
public void Configure(IApplicationBuilder app, IWebHostEnvironment env)
{
    app.UseRouting();
    app.UseEndpoints(c => c.MapControllers());

    using (var scope = app.ApplicationServices.CreateScope())
    {
        var services = scope.ServiceProvider;

        // BUGFIX
        var context = services.GetRequiredService<TracerPayload>();
        using (var trace = context.Tracer.StartRootSpan("TEST"))
        {

        }

        context.Provider.ForceFlush();
        context.Reset();
    }
}

What is the expected behavior?

All completed spans should have a valid ActivityContext that includes the correct TraceId and SpanIds.

What is the actual behavior?

The first span is default(ActivityContext)

Reproduce

I believe the above unit test should be reproducible, but I can contribute a clean version of that to the project if it'd be helpful.

cijothomas commented 2 years ago

Could you try the following lines well before the 1st activity is even triggered? In .NET Core 3.1, I think Activity uses the legacy id formats to be back-compact, so want to rule that out first.

Activity.DefaultIdFormat = ActivityIdFormat.W3C;
Activity.ForceDefaultIdFormat = true;

If this doesn't help, yes a simple unit test showing the issue would really help.

Aaronontheweb commented 2 years ago

I'll give that a try real fast.

Aaronontheweb commented 2 years ago

Can confirm that this resolved the issue.

Aaronontheweb commented 2 years ago

Although, not sure if that's something that should still be patched or not

cijothomas commented 2 years ago

SDK already does this inside a static ctor. https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/src/OpenTelemetry/Sdk.cs#L38-L39

But there are cases where activity is created even before the OTel SDK is initialized, requiring manually forcing this (as you did to recover from this issue.)

Only an issue for .NET Core 3.1, as .NET 5 onwards, W3CTraceContext format is the default.