Selective Http Trace and Baggage Propagation

martinjt commented 3 weeks ago

Looking for some feedback on a change to the HttpClient instrumentation package I've been playing with.

Ultimately, I'm trying to get to a point where you can choose which HttpClient's will have trace/baggage propagation and which won't. With the ultimate goal of stopping trace/baggage leakage to domains you don't control.

I originally thought about domain allow/deny lists, but I'm not sure that's the right approach. So then I pivoted to adding a handler instead.

Right now, TracePropagation happens in 2 places. The DiagnosticHandler added by the runtime The HttpClient instrumentation package listening to a Diagnostic source that is written to by 1

You can disable 1 by using the following 2 switches:

AppContext.SetSwitch("DOTNET_SYSTEM_NET_HTTP_ENABLEACTIVITYPROPAGATION", false);
AppContext.SetSwitch("System.Net.Http.EnableActivityPropagation", false);

This has the effect of disabling propagation for the entire application. Further, it also stops the generation of the outbound HTTP span.

In an ideal world, you'd be able to do this at a HttpClient level, and selective choose:

No span and no propagation
No span but propagate
Span, but no propagation
Span and Propagate

The scenario where this is useful is where you're using a third party API which you want to track the performance of, but you don't want them to receive baggage and trace details.

What I've been toying with is adding something to the Http Instrumentation package that allows you to "turn off" instrumentation, then add it back it in incrementally.

builder.Services.AddOpenTelemetry()
    .WithTracing(tracingOptions => tracingOptions
        .AddHttpClientInstrumentation(httpOptions => {
            httpOptions.DisableAutoPropagation = true;
        })

Then have 2 handlers such that you can do..

builder.Services.AddHttpClient<WeatherApiClient>(client => client.BaseAddress = new("http://backend"))
    .AddHttpMessageHandler<TracingHandlerWithPropagation>();

builder.Services.AddHttpClient<WeatherApiClient>(client => client.BaseAddress = new("http://backend"))
    .AddHttpMessageHandler<TracingHandlerWithoutPropagation>();

Then add that an extension like

builder.Services.AddHttpClient<WeatherApiClient>(client => client.BaseAddress = new("http://backend"))
    .WithOpenTelemetry(options => {
        options.PropagationEnabled = true;
        options.Tracked = true;
    });

(Names TBC)

The issue I see right now is that the DiagnosticListener is the way we listen to the http calls, and that also does propagation.

So I could make it so that Handler generates the Activity directly, and ignore the diagnostic listener. In effect the diagnostic listener would only be relevant for "Auto instrumented" stuff coming out of the runtime. The alternative is to have the handler only generate diagnostics when it should be traced, and not in other scenarios. Or I use a different diagnostic name.

So is this something worthwhile for people? is the approach viable?

CodeBlanch commented 3 weeks ago

Couldn't you do this?

    public async Task<string> CallThirdPartyServiceAndReturnResponse(Uri requestUri)
    {
        // Turn off flow of Activity & Baggage for this ExecutionContext
        using var flow = ExecutionContext.SuppressFlow();

        // Make the call with no trace or baggage to inject
        return await this.httpClient.GetStringAsync(requestUri);
    }

The nice thing about that is it works (in theory) for all things not just HttpClient. Same thing but calling some message queue:

    public async Task PostMessageToThirdPartyService(Uri requestUri)
    {
        // Turn off flow of Activity & Baggage for this ExecutionContext
        using var flow = ExecutionContext.SuppressFlow();

        // Post message with no trace or baggage to inject
        await this.myAwesomeMessageService.PostMessageAsync(requestUri);
    }

SuppressFlow can be kind of wonky and doesn't work if everything ends up executing synchronously. That might need like Task.Yield before the await to be 100% solid. But conceptually, this is how you turn off flow of AsyncLocal which I think might be helpful for this case.

martinjt commented 3 weeks ago

From what I can tell from testing this, it does stop activity from being created, however, the runtime code does not honor that and still propagates the trace data.

martinjt commented 3 weeks ago

So ultimately, the issue is that I need to disable the runtime propagation, but if I do that (with the switches), then the Http instrumentation doesn't kick in, because it's based on the diagnostic listener that's generated by the DiagnosticHandler which the switches disable.

CodeBlanch commented 3 weeks ago

Ya it will still create an Activity \ span but it shouldn't be parented to the current trace it will be a new root. So you won't leak the internal trace/baggage to the third party. Wasn't that the goal?

If you wanted to stop the Activity \ span from being created completely, really the OTel way to do that is inside the sampler. The problem is though HttpClient isn't passing any tags into the sampler so there isn't really anyway to drop just some specific destination 😢 IIRC the spec says there are some standard tags which should be supplied at sample time it just isn't part of runtime logic yet.

You could probably do something like this:

// At call site
using var scope = MySampler.SuppressActivityCreation();

await httpClient.DoSomethingAsync();

// Set as OTel sampler
internal sealed class MySampler : Sampler
{
    private static readonly AsyncLocal<bool> SuppressActivity = new AsyncLocal<bool>();
    private readonly Sampler innerSampler;

    public MySampler(Sampler innerSampler)
    {
        this.innerSampler = innerSampler ?? throw new ArgumentNullException(nameof(innerSampler));
    }

    public static IDisposable SuppressActivityCreation()
    {
        var previousValue = SuppressActivity.Value;

        SuppressActivity.Value = true;

        return new MySamplerScope(previousValue);
    }

    public override SamplingResult ShouldSample(in SamplingParameters samplingParameters)
    {
        if (SuppressActivity.Value)
        {
            return new SamplingResult(SamplingDecision.Drop);
        }

        return this.innerSampler.ShouldSample(in samplingParameters);
    }

    private sealed class MySamplerScope : IDisposable
    {
        private readonly bool previousValue;

        public MySamplerScope(bool previousValue)
        {
            this.previousValue = previousValue;
        }

        public void Dispose()
        {
            SuppressActivity.Value = this.previousValue;
        }
    }
}

I didn't test any of that just an idea!

lmolkova commented 3 weeks ago

I think there are two fundamental problems:

baggage flows everywhere and you can't control it unless it's enableable (per endpoint/request/etc).
different endpoints have different context propagation formats

Having a propagator configurable per endpoint might solve both. Also, it's not a unique .NET problem - all languages have to some extent.

We can try to extend Propagator spec so that propagator knows the endpoint/some other context it's propagating to It probably could be done today with some duct tape: custom baggage propagator and AsyncLocal

class ConfigurablePropagator: Propagator {
   public static readonly AsyncLocal<string> Endpoint = new AsyncLocal<string>(); 

   public override void Inject(...) {
       propagators[Endpoint.Value].Inject(...)
   }
}

DistributedContextPropagator Current = DistributedContextPropagator.CreateNoOutputPropagator(); // I hope it works
Sdk.SetDefaultTextMapPropagator(new ConfigurablePropagator(....));

ConfigurablePropagator.Endpoint = "https://external-no-baggage-prop.com";
await httpClient.SendAsync()...
ConditionalPropagator.Endpoint = null;

UPDATE: I assume it should be possible to implement a customizeable DistributedContextPropagator and give .NET a custom propagator.

martinjt commented 3 weeks ago

Ya it will still create an Activity \ span but it shouldn't be parented to the current trace it will be a new root.

@CodeBlanch I'm saying that isn't happening. The propagation happens at the DiagnosticHandler level in the runtime. It's not even touching otel. So it's not creating the Activity, but is still propagating the context.

it's not a unique .NET problem - all languages have to some extent.

JS and go, you have to add the propagation handler from otel to the specific HTTP client. Since dotnet does this in the runtime, there's no way to turn it off. It happens even if you don't have otel installed at all.

martinjt commented 3 weeks ago

This is what we get with the execution flow applied. You'll notice that the second row shows that the backend has used the propagation context from the first.

What we need is the ability to Disable the propagation part, based on a handler of some sort.

What the execution flow IS doing is making the span not inherit the parent, and making propagation be based on the root. Which doesn't give the desired outcome

open-telemetry / opentelemetry-dotnet-contrib

Selective Http Trace and Baggage Propagation #1853