graphql-dotnet / graphql-dotnet

GraphQL for .NET
https://graphql-dotnet.github.io
MIT License
5.86k stars 923 forks source link

GraphQLTelemetryProvider improvements #3847

Closed gao-artur closed 10 months ago

gao-artur commented 10 months ago

Filter: In the current implementation, GraphQLTelemetryOptions.Filter acts more like Skip or Suppress. It only skips the GraphQL span creation, but all the following spans will be created as usual. The Filter option is usually used to "cut" the span creation from this point, so, for example, if the graph resolver makes a DB roundtrip, this won't be recorded. See my implementation in the Hangfire instrumentation, for example (which is based on how this feature is implemented in other instrumentations). When is this useful? For example, we have a few mutations that, in some cases, can create a huge trace (20k+ spans). Parts of these traces can be dropped because of memory limits, breaking the trace and making it useless. We want to be able to filter these traces and not create them at all based on mutation names.

EnrichWith*: Currently, EnrichWith* delegates are executed regardless of the Activity.IsAllDataRequested state which is a waste of CPU. When the IsAllDataRequested = false, the span (activity) won't be recorded anyway, so adding tags is useless. Note that the recommended way to add tags is by calling SetTag method that performs tags dedup, which is fast but not free.

Set*TagsAsync: Similar to EnrichWith*, the Set*TagsAsync should be called only when Activity.IsAllDataRequested = true.

gao-artur commented 10 months ago

Since the operation type is not available before validation, it can be useful to move the Filter execution to TelemetryListener.AfterValidationAsync. It also means changing the delegate type.

gao-artur commented 10 months ago

Also, if I want to provide my own implementation of the GraphQLTelemetryProvider, it's not an easy task.

Shane32 commented 10 months ago

Filter acts like skip

As long as there is documentation or a MS reference or something that we can use as a basis, I'm fine with changing it like you suggested. It should be well documented (e.g. within the XML comments) how the filter works, as it will then be filtering other events and not just GraphQL.NET events. Also it's very likely that ASP.NET Core will already have recorded a trace event before hitting that code. It will be important to consider the effect that changing the property will have. Should it change the values back at the end of the method? Similarly, what if a filter were implemented at a deeper layer, such as in a service within some business logic? How would it operate, and how would it interact with this code?

Please note that I would not use OpenTelemetry as a basis for documentation. Activity and related classes are part of .NET and the operation of GraphQL.NET code should be based on official MS / .NET documentation, not a third-party library. All GraphQL.NET telemetry code is designed to work with any telemetry source providers.

Currently, EnrichWith* delegates are executed regardless of the Activity.IsAllDataRequested state

Similar to EnrichWith, the SetTagsAsync should be called only when Activity.IsAllDataRequested = true.

Similar to the above, if there is supporting documentation that is not OpenTelemetry-specific, indicating that this is the preferred behavior, than I would support the change. I have not yet seen it. Rather, the Microsoft documentation says this:

If you're instrumenting functions with high-performance requirements, Activity.IsAllDataRequested is a hint that indicates whether any of the code listening to Activities intends to read auxiliary information such as Tags. If no listener will read it, then there is no need for the instrumented code to spend CPU cycles populating it. For simplicity, this sample doesn't apply that optimization.

The above snippet indicates that even if IsAllDataRequested is false, the activity may be logged for metrics, even if all the details are not logged. In addition, the current implementation should have an extremely minimal performance impact when IsAllDataRequested is false. It will add the document listener, creating an allocation, but all other calls should be allocation-free, essentially a noop. Besides, these calls occur only during the start and end of the GraphQL request, and is not dependent on the complexity of the GraphQL request.

It seems to me that OpenTelemetry's provider simply uses the IsAllDataRequested flag in the manner you described, which may be different than other telemetry provider's implementations.

Since the operation type is not available before validation, it can be useful to move the Filter execution to TelemetryListener.AfterValidationAsync. It also means changing the delegate type.

I disagree. The activity should start at the beginning of the operation. It's very troublesome investigating timeline logs and having 'missing time'. Imagine that the validation included authentication, which required some DB accesses to verify auth tokens. Now those DB calls' activity events occur outside the GraphQL operation. Moreover, if the filter indicated to skip logging, now there would be DB activity events logged but no GraphQL events logged, making it even more confusing to read the logs.

In your scenario, users could execute this prior to calling .UseTelemetry():

        .ConfigureExecution((options, next) =>
        {
            if (options.Document == null && options.Query != null)
            {
                // add try/catch block here
                options.Document = GraphQLParser.Parser.Parse(options.Query, new GraphQLParser.ParserOptions { Ignore = GraphQLParser.IgnoreOptions.Comments });
            }
            return next(options);
        });

This would then not time the portion of the code where the parser executes, and if document caching or APQ support were enabled, it would also be required to be installed prior to this code. But you could program the filter to ignore certain operations.

If I want to override ExecuteAsync

Then there's no point to inheriting from the class. The protected methods serve to be the extension points here. Just duplicate the class and change as desired. The only reason ExecuteAsync is virtual is so that you can wrap functionality, and since activity is both created and disposed within the base method, there's not even much use there. There's already extension points for code to run before and after the call to next(options), via protected methods SetInitialTagsAsync and SetResultTagsAsync.

It can be simplified by adding UseTelemetry<TProvider>(...) where TProvider : GraphQLTelemetryProvider overload.

Makes sense to me.

Shane32 commented 10 months ago

I also reviewed your code that sets activity.IsAllDataRequested = false; based on the filter. Why would this have any effect on other 'child' activites, such as DB calls? A DB call is going to call ActivitySource.StartActivity itself; wouldn't that new object have IsAllDataRequested as true?

If not, it seems like it would be implementation-specific behavior.

gao-artur commented 10 months ago

I'll try to explore other tracing solutions, currently, I'll answer in the OpenTelemetry context.

Also it's very likely that ASP.NET Core will already have recorded a trace event before hitting that code.

You are right, and I asked a relevant question in the past, unfortunately, with no answer. Today, I understand there are workarounds when GraphQL is the only child of ASP.NET, but it is impossible in general cases.

Should it change the values back at the end of the method?

No, because we change the value of the activity we just created. It doesn't affect anything except this activity and its children. The parent activity is not affected either.

Similarly, what if a filter were implemented at a deeper layer

These filters won't be executed because their activity will be null. See the unit test below.

the current implementation should have an extremely minimal performance impact when IsAllDataRequested is false

You are right about tags added by the library. There are no allocations, but the dedup logic is still executed when calling SetTag. And the performance of SetTag depends on the number of tags added to the activity because it enumerates the whole list on every call. I saw a discussion about recommending using AddTag (no dedup) instead of SetTag when the list of tags is controlled and it's known there are no duplicates to avoid this overhead. But EnrichWith is a user code that can be expensive. These methods shouldn't be executed if the activity will be discarded.

The activity should start at the beginning of the operation.

I didn't suggest moving the activity start. Only the filter decision. Note that activity must be created to record the filter decision. But your following thoughts prove this is a bad suggestion:

Moreover, if the filter indicated to skip logging, now there would be DB activity events logged but no GraphQL events logged

Note it happens because validation activities were created before the filter decision was made and not because they are created in different scope. They will still be children of GraphQL activity.

The only reason ExecuteAsync is virtual is so that you can wrap functionality

Currently, I'm overriding it to change the filter implementation as suggested above, but I want the other virtual methods to have their original implementation. But I agree that in general case, this method shouldn't be overridden.

I also reviewed your code that sets activity.IsAllDataRequested = false;

You are right, setting IsAllDataRequested = false won't change child activity behavior. What matters is resetting the ActivityTraceFlags.Recorded flag which will cause all the children's activities to be null. The combination of both is equivalent to the sampler decision Drop.

Check this unit test to see the effect of IsAllDataRequested and ActivityTraceFlags on child activities:

private static readonly ActivitySource _activitySource = new("Test");

[Theory]
[InlineData(false)]
[InlineData(true)]
public async Task Parent(bool resetRecordedFlag)
{
    var exportedItems = new List<Activity>();
    using (Sdk.CreateTracerProviderBuilder()
               .AddSource("Test")
               .AddInMemoryExporter(exportedItems)
               .Build())
    {
        // the parent is created with IsAllDataRequested = true;
        using var parent = _activitySource.StartActivity();
        parent.ShouldNotBeNull();
        parent.IsAllDataRequested.ShouldBeTrue();

        // no configuration change on parent - the child created with IsAllDataRequested = true
        using var firstChild = await CreateChild("FirstChild");
        firstChild.ShouldNotBeNull();
        firstChild.IsAllDataRequested.ShouldBeTrue();

        // parent.IsAllDataRequested = false, but child created with IsAllDataRequested = true
        parent.IsAllDataRequested = false;
        using var secondChild = await CreateChild("SecondChild");
        secondChild.ShouldNotBeNull();
        secondChild.IsAllDataRequested.ShouldBeTrue();

        if (resetRecordedFlag)
        {
            // after resetting ActivityTraceFlags.Recorded flag, all the children will be null
            parent.ActivityTraceFlags &= ~ActivityTraceFlags.Recorded;
            using var thirdChild = await CreateChild("ThirdChild");
            thirdChild.ShouldBeNull();
        }
    }

    // note the Parent is not exported regardless of the ActivityTraceFlags
    // what's matter is IsAllDataRequested
    exportedItems.Select(activity => activity.DisplayName).OrderBy(x => x).ToArray()
        .ShouldBeEquivalentTo(new[] { "FirstChild", "SecondChild" });
}

private async Task<Activity> CreateChild(string name)
{
    await Task.CompletedTask;
    return _activitySource.StartActivity(name);
}

Also note that ActivityTraceFlags is used to propagate the sampling or filter decision to the remote process, so they won't make the opposite decision.

gao-artur commented 10 months ago

The ActivityListener with a specific configuration can be used instead of OTel SDK to get the same result.

private static readonly ActivitySource _activitySource = new("Test");

[Theory]
[InlineData(true, false)]
[InlineData(true, true)]
[InlineData(false, false)]
[InlineData(false, true)]
public async Task Parent(bool useOTel, bool resetRecordedFlag)
{
    var exportedItems = new List<Activity>();
    TracerProvider tracer = null;

    if (useOTel)
    {
        tracer = Sdk.CreateTracerProviderBuilder()
            .AddSource("Test")
            .AddInMemoryExporter(exportedItems)
            .Build();
    }
    else
    {
        Activity.DefaultIdFormat = ActivityIdFormat.W3C;
        Activity.ForceDefaultIdFormat = true;
        ActivitySource.AddActivityListener(new ActivityListener
        {
            ShouldListenTo = source => source.Name == "Test",
            Sample = (ref ActivityCreationOptions<ActivityContext> options) =>
                options.Parent == default ||
                options.Parent.TraceFlags.HasFlag(ActivityTraceFlags.Recorded)
                    ? ActivitySamplingResult.AllDataAndRecorded
                    : ActivitySamplingResult.None,
            ActivityStopped = activity =>
            {
                if (activity.IsAllDataRequested)
                {
                    exportedItems.Add(activity);
                }
            }
        });
    }

    using (tracer)
    {
        // the parent is created with IsAllDataRequested = true;
        using var parent = _activitySource.StartActivity();
        parent.ShouldNotBeNull();
        parent.IsAllDataRequested.ShouldBeTrue();

        // no configuration change on parent - the child created with IsAllDataRequested = true
        using var firstChild = await CreateChild("FirstChild");
        firstChild.ShouldNotBeNull();
        firstChild.IsAllDataRequested.ShouldBeTrue();

        // parent.IsAllDataRequested = false, but child created with IsAllDataRequested = true
        parent.IsAllDataRequested = false;
        using var secondChild = await CreateChild("SecondChild");
        secondChild.ShouldNotBeNull();
        secondChild.IsAllDataRequested.ShouldBeTrue();

        if (resetRecordedFlag)
        {
            // after resetting ActivityTraceFlags.Recorded flag, all the children will be null
            parent.ActivityTraceFlags &= ~ActivityTraceFlags.Recorded;
            using var thirdChild = await CreateChild("ThirdChild");
            thirdChild.ShouldBeNull();
        }
    }

    // note the Parent is not exported regardless of the ActivityTraceFlags
    // what's matter is IsAllDataRequested
    exportedItems.Select(activity => activity.DisplayName).OrderBy(x => x).ToArray()
        .ShouldBeEquivalentTo(new[] { "FirstChild", "SecondChild" });
}

private async Task<Activity> CreateChild(string name)
{
    await Task.CompletedTask;
    return _activitySource.StartActivity(name);
}
Shane32 commented 10 months ago

Just for the record, I've not used OpenTelemetry much, and when I attempted to do so, I was unable to figure out how to view the logs in Azure. As far as I can tell it did not record any custom events. I'm sure I'm doing something wrong, but in any case it's obvious you have much more experience here than me. So I'm just providing another perspective here.

which will cause all the children's activities to be null

Hmm, I'm curious: in the context of OpenTelemetry, if a sample will not be recorded, doesn't StartActivity return null just as if ActivityTraceFlags.Recorded was not set?

gao-artur commented 10 months ago

Do you mean SamplingDecision.Drop? image

gao-artur commented 10 months ago

Sometimes .NET implementation deviates from the spec. For example, the activity must be created even when the trace is dropped to allow correlating logs. Also, setting the tag with null value removes the tag from the activity, which is not spec compliment (I can't find the issue right now).

gao-artur commented 10 months ago

I found another problem in the current implementation. The GraphQLTelemetryProvider.StartActivityAsync can't be async. When calling ActivitySource.StartActivity, it sets the newly created activity in Activity.Current, which is AsyncLocal<>. If the user overrides this method and makes it async, the ExecuteAsync will see its parent in Activity.Current and not the newly created activity. This method should be changed to sync.

Shane32 commented 10 months ago

can’t be async

You’re right.

Shane32 commented 10 months ago

So in the case of SamplingDecision.Drop: ASP.NET is going to create the 'root' Activity, with IsAllDataRequested false and Activity.Recorded not set. Then control passes to GraphQL.NET and since Recorded is not set on the parent Activity, StartActivity returns null?

gao-artur commented 10 months ago

Correct. And the Activity.Current keeps the reference to Asp.Net created activity. It means all the downstream calls to ActivitySource.StartActivity will return.

Shane32 commented 10 months ago

So from a practical viewpoint, IsAllDataRequested will probably always be true within GraphQL.NET, unless the filter were to change it (which is not currently possible as the filter runs before the activity is created).

gao-artur commented 10 months ago

Sounds correct. And I think setting IsAllDataRequested = false without resetting Recorded flag is a bad idea because the child activity will have its ParentId set to an activity that won't be exported. Then, when building the trace visualization, this span will be shown as a root span in the trace, because its parent doesn't exist (that's what Jaeger and Tempo do). But I need to verify yet if there are some legal usages for IsAllDataRequested = false alone.

Shane32 commented 10 months ago

So considering all of our discussion here, what are our conclusions?

Suggestion @Shane32 suggests @gao-artur suggests
Filter occur after validation No
Filter after creating Activity Yes
Filter should clear Recorded, etc Yes
StartActivity not async Yes
AddTag vs SetTag Leave as-is
Set tags when ... as-is, or when IsAllDataRequested or Recorded
EnrichWith when ... matching when setting tags
Add extension methods Yes
Make TelemetryListener public No
Shane32 commented 10 months ago

Does filtering after creating an activity have a benefit? Seems like we would need to create the activity in order to set Recorded=false. But then we might as well create the activity first and pass it to the filter method as well. And if the activity is null we need not even call the filter method.

I still not convinced that we should not record any tags when IsAllDataRequested=false. It seems like quite a low-cost operation, and it only occurs a couple times per request. At the same time, the MS docs would indicate that tags don't need to be added when IsAllDataRequested is false. There's also Recorded=false - do we have any if conditions based on that flag? And the extension methods - should they execute regardless (letting user code check for IsAllDataRequested)? Can we imagine a use case where users want Enrich to run even if they are not adding tags? Maybe if we don't set any tags then we don't need to call enrich.

Shane32 commented 10 months ago

Does our decision apply equally to all EnrichWith? Seems like if the telemetry provider was recording metrics (in theory) with IsAllDataRequested=false and Recorded=True, seems like it would be important to call activity.SetStatus, for example.

Shane32 commented 10 months ago

Maybe the only time we should skip calling EnrichWith or setting tags is when both IsAllDataRequested=false and Recorded=false

gao-artur commented 10 months ago

Updated my column.

Suggestion @Shane32 suggests @gao-artur suggests
Filter occur after validation No No
Filter after creating Activity Yes
Filter should clear Recorded, etc Yes Yes
StartActivity not async Yes Yes
AddTag vs SetTag Leave as-is Leave as-is
SetTag when IsAllDataRequested=false No
EnrichWith when IsAllDataRequested=false No
SetTag when Recorded=false Don't check this at all
EnrichWith when Recorded=false Don't check this at all
Add extension methods Yes Yes
Make TelemetryListener public No No

create the activity first and pass it to the filter method as well

The filter shouldn't be based on activity, and the user shouldn't alter the activity from the filter. Also, it will be inconsistent with any other instrumentation providing a Filter delegate.

And if the activity is null we need not even call the filter method.

Correct. Just skip all the telemetry logic.

I still not convinced that we should not record any tags when IsAllDataRequested=false

Yes, this part is the most doubtful. I base my opinion on other instrumentation implementations in the core and contrib repos, and all of them skip adding tags when IsAllDataRequested=false. And if we skip adding tags, there is no point in giving the app writers to add their tags.

recording metrics

Unfortunately, I haven't worked with metrics yet and don't know how they work and what info they need, so I can't give any advice here.

when both IsAllDataRequested=false and Recorded=false

I think Recorded flag is for SDK and shouldn't be examined by instrumentations but I'm not 100% sure.

gao-artur commented 10 months ago

Actually, traces and metrics are independent API's and we shouldn't consider them here. During the initial OTel PR review, there was a suggestion to add the word Trace to the extension methods and options class name to extend the telemetry with metrics in the future.

Shane32 commented 10 months ago

Do you want to make a PR for review? Probably need to target develop if we are changing the StartActivity method signature.

Also if we allow derived classes from GraphQLTelemetryProvider then we need to consider the effect when someone calls UseTelemetry<T> on a derived class after OpenTelemetry has already run EnableAutoInstrumentation and the base class was already added to the pipeline. As of now, TryRegister prevents it from being registered twice.

Shane32 commented 10 months ago

You could also separate into separate PRs the different feature changes we have discussed, for easier review and approval (??) - your discretion.

Shane32 commented 10 months ago

Also if we allow derived classes from GraphQLTelemetryProvider then we need to consider the effect when someone calls UseTelemetry<T> on a derived class after OpenTelemetry has already run EnableAutoInstrumentation and the base class was already added to the pipeline. As of now, TryRegister prevents it from being registered twice.

I'll take a whack at this

gao-artur commented 10 months ago

Sure, will do. Is it ok to add a sync StartActivity and mark the StartActivityAsync as obsolete to allow this this change in v7? The StartActivityAsync will call the sync version internally. And what about the change in the Filter behavior? Do you consider it a breaking change?