open-telemetry / opentelemetry-dotnet

The OpenTelemetry .NET Client
https://opentelemetry.io
Apache License 2.0
3.21k stars 759 forks source link

ForceFlush on BatchActivityExportProcessor does not guarantee Activities exported #4858

Open Paulskit opened 1 year ago

Paulskit commented 1 year ago

Bug Report

List of all OpenTelemetry NuGet packages and version that you are using:

<Project Sdk="Microsoft.NET.Sdk">

    <PropertyGroup>
        <OutputType>Exe</OutputType>
        <TargetFramework>net6.0</TargetFramework>
        <ImplicitUsings>enable</ImplicitUsings>
        <Nullable>enable</Nullable>
    </PropertyGroup>

    <ItemGroup>
        <PackageReference Include="OpenTelemetry" Version="1.6.0" />
        <PackageReference Include="OpenTelemetry.Instrumentation.Http" Version="1.5.1-beta.1" />
    </ItemGroup>

</Project>

Symptom

ForceFlush returns true, despite Activity export is still in progress and not finished.

What is the expected behavior?

ForceFlush should return true only when all activities was exported successfully.

What is the actual behavior?

image

Reproduce

using System.Diagnostics;
using OpenTelemetry;
using OpenTelemetry.Trace;

var tracerProvider = Sdk.CreateTracerProviderBuilder()
    .AddHttpClientInstrumentation()
    .AddProcessor(new BatchActivityExportProcessor(new Exporter()))
    .Build()!;

using (tracerProvider)
{
    var httpClient = new HttpClient();
    await httpClient.GetStringAsync("https://api.nuget.org/v3/index.json");

    Log.WriteLine("ForceFlush invoked");
    var flushResult = tracerProvider.ForceFlush();
    Log.WriteLine("ForceFlush finished with result {0}", flushResult);
}

Log.WriteLine("TracerProvider Disposed");

class Exporter : BaseExporter<Activity>
{
    public override ExportResult Export(in Batch<Activity> batch)
    {
        Log.WriteLine("Start exporting {0} activities", batch.Count);
        foreach (var activity in batch)
        {   
            Log.WriteLine("Exporting Activity {0}", activity);
        }
        Task.Delay(3000).GetAwaiter().GetResult();
        Log.WriteLine("Finished exporting {0} activities", batch.Count);

        return ExportResult.Success;
    }
}

static class Log
{
    public static void WriteLine(string message, params object[] args)
    {
        Console.WriteLine($"[{DateTime.UtcNow.TimeOfDay}] {message}", args);
    }
}
pellared commented 1 year ago

You need to override the implementation of OnForceFlush:

https://github.com/open-telemetry/opentelemetry-dotnet/blob/fe78453c03feb8dbe506b2a0284312bdfa1367c5/src/OpenTelemetry/BaseExporter.cs#L135-L154

pellared commented 1 year ago

Also this is just a "base implementation" and it is up to the "concrete implementation" to do its best to adhere to the specification. Doing any logic would be a breaking change.

pellared commented 1 year ago

Also the docs says that one can implement OnForceFlush.

@Paulskit Do you see any area of improvement or can we close the issue?

Paulskit commented 1 year ago

Hey @pellared, thanks for taking a look. Exporter in the repro is just to showcase the problem. OTLP Exporter behaves the same way. My expectation, ForceFlush should return only after all activities in queue has been processed. I see other implementations in otel, where we rely on the same behaviour. Like this one in AWS Lambda: https://github.com/open-telemetry/opentelemetry-dotnet-contrib/blob/2960a9defbe35ae14acc3b0b00983d09b186d61a/src/OpenTelemetry.Instrumentation.AWSLambda/AWSLambdaWrapper.cs#L201

We use force flush to make sure all recorded activities has been exported before lambda completes.

Quick investigation shows that combination of 2 behaviours lead to this issue: 1) CIrcularBuffer implementation where items end up in removed right after we read them https://github.com/open-telemetry/opentelemetry-dotnet/blob/fe78453c03feb8dbe506b2a0284312bdfa1367c5/src/OpenTelemetry/Internal/CircularBuffer.cs#L159

2) BatchExportProcessor implementation where exit condition is based on removed items https://github.com/open-telemetry/opentelemetry-dotnet/blob/fe78453c03feb8dbe506b2a0284312bdfa1367c5/src/OpenTelemetry/BatchExportProcessor.cs#L194

We have to read items in order to process them. Once we read items, ForceFlush exits (exit condition is successful), despite export process might still be in progress.

Thus, I think we should fix either the OTLP Exporter or BatchExportProcessor to properly support ForceFlush.

karlssonsimon commented 4 months ago

We are also encountering this issue with short-lived cron jobs running in Kubernetes.

@Paulskit did you find any workaround?