Closed gao-artur closed 10 months ago
Since the operation type is not available before validation, it can be useful to move the Filter
execution to TelemetryListener.AfterValidationAsync
. It also means changing the delegate type.
Also, if I want to provide my own implementation of the GraphQLTelemetryProvider
, it's not an easy task.
ExecuteAsync
, I need to copy the private TelemetryListener
and change its constructor to receive my custom provider implementation type instead of GraphQLTelemetryProvider
to be able to access the protected SetOperationTagsAsync
method.UseTelemetry
to know how to do that, which is still not trivial. It can be simplified by adding UseTelemetry<TProvider>(...) where TProvider : GraphQLTelemetryProvider
overload.Filter acts like skip
As long as there is documentation or a MS reference or something that we can use as a basis, I'm fine with changing it like you suggested. It should be well documented (e.g. within the XML comments) how the filter works, as it will then be filtering other events and not just GraphQL.NET events. Also it's very likely that ASP.NET Core will already have recorded a trace event before hitting that code. It will be important to consider the effect that changing the property will have. Should it change the values back at the end of the method? Similarly, what if a filter were implemented at a deeper layer, such as in a service within some business logic? How would it operate, and how would it interact with this code?
Please note that I would not use OpenTelemetry as a basis for documentation. Activity
and related classes are part of .NET and the operation of GraphQL.NET code should be based on official MS / .NET documentation, not a third-party library. All GraphQL.NET telemetry code is designed to work with any telemetry source providers.
Currently, EnrichWith* delegates are executed regardless of the Activity.IsAllDataRequested state
Similar to EnrichWith, the SetTagsAsync should be called only when Activity.IsAllDataRequested = true.
Similar to the above, if there is supporting documentation that is not OpenTelemetry
-specific, indicating that this is the preferred behavior, than I would support the change. I have not yet seen it. Rather, the Microsoft documentation says this:
If you're instrumenting functions with high-performance requirements, Activity.IsAllDataRequested is a hint that indicates whether any of the code listening to Activities intends to read auxiliary information such as Tags. If no listener will read it, then there is no need for the instrumented code to spend CPU cycles populating it. For simplicity, this sample doesn't apply that optimization.
The above snippet indicates that even if IsAllDataRequested
is false
, the activity may be logged for metrics, even if all the details are not logged. In addition, the current implementation should have an extremely minimal performance impact when IsAllDataRequested
is false
. It will add the document listener, creating an allocation, but all other calls should be allocation-free, essentially a noop. Besides, these calls occur only during the start and end of the GraphQL request, and is not dependent on the complexity of the GraphQL request.
It seems to me that OpenTelemetry
's provider simply uses the IsAllDataRequested
flag in the manner you described, which may be different than other telemetry provider's implementations.
Since the operation type is not available before validation, it can be useful to move the Filter execution to TelemetryListener.AfterValidationAsync. It also means changing the delegate type.
I disagree. The activity should start at the beginning of the operation. It's very troublesome investigating timeline logs and having 'missing time'. Imagine that the validation included authentication, which required some DB accesses to verify auth tokens. Now those DB calls' activity events occur outside the GraphQL operation. Moreover, if the filter indicated to skip logging, now there would be DB activity events logged but no GraphQL events logged, making it even more confusing to read the logs.
In your scenario, users could execute this prior to calling .UseTelemetry()
:
.ConfigureExecution((options, next) =>
{
if (options.Document == null && options.Query != null)
{
// add try/catch block here
options.Document = GraphQLParser.Parser.Parse(options.Query, new GraphQLParser.ParserOptions { Ignore = GraphQLParser.IgnoreOptions.Comments });
}
return next(options);
});
This would then not time the portion of the code where the parser executes, and if document caching or APQ support were enabled, it would also be required to be installed prior to this code. But you could program the filter to ignore certain operations.
If I want to override ExecuteAsync
Then there's no point to inheriting from the class. The protected methods serve to be the extension points here. Just duplicate the class and change as desired. The only reason ExecuteAsync
is virtual
is so that you can wrap functionality, and since activity
is both created and disposed within the base method, there's not even much use there. There's already extension points for code to run before and after the call to next(options)
, via protected methods SetInitialTagsAsync
and SetResultTagsAsync
.
It can be simplified by adding
UseTelemetry<TProvider>(...) where TProvider : GraphQLTelemetryProvider
overload.
Makes sense to me.
I also reviewed your code that sets activity.IsAllDataRequested = false;
based on the filter. Why would this have any effect on other 'child' activites, such as DB calls? A DB call is going to call ActivitySource.StartActivity
itself; wouldn't that new object have IsAllDataRequested
as true
?
If not, it seems like it would be implementation-specific behavior.
I'll try to explore other tracing solutions, currently, I'll answer in the OpenTelemetry context.
Also it's very likely that ASP.NET Core will already have recorded a trace event before hitting that code.
You are right, and I asked a relevant question in the past, unfortunately, with no answer. Today, I understand there are workarounds when GraphQL is the only child of ASP.NET, but it is impossible in general cases.
Should it change the values back at the end of the method?
No, because we change the value of the activity we just created. It doesn't affect anything except this activity and its children. The parent activity is not affected either.
Similarly, what if a filter were implemented at a deeper layer
These filters won't be executed because their activity will be null
. See the unit test below.
the current implementation should have an extremely minimal performance impact when IsAllDataRequested is false
You are right about tags added by the library. There are no allocations, but the dedup logic is still executed when calling SetTag
. And the performance of SetTag
depends on the number of tags added to the activity because it enumerates the whole list on every call. I saw a discussion about recommending using AddTag
(no dedup) instead of SetTag
when the list of tags is controlled and it's known there are no duplicates to avoid this overhead.
But EnrichWith
is a user code that can be expensive. These methods shouldn't be executed if the activity will be discarded.
The activity should start at the beginning of the operation.
I didn't suggest moving the activity start. Only the filter decision. Note that activity must be created to record the filter decision. But your following thoughts prove this is a bad suggestion:
Moreover, if the filter indicated to skip logging, now there would be DB activity events logged but no GraphQL events logged
Note it happens because validation activities were created before the filter decision was made and not because they are created in different scope. They will still be children of GraphQL activity.
The only reason ExecuteAsync is virtual is so that you can wrap functionality
Currently, I'm overriding it to change the filter implementation as suggested above, but I want the other virtual methods to have their original implementation. But I agree that in general case, this method shouldn't be overridden.
I also reviewed your code that sets activity.IsAllDataRequested = false;
You are right, setting IsAllDataRequested = false
won't change child activity behavior. What matters is resetting the ActivityTraceFlags.Recorded
flag which will cause all the children's activities to be null
. The combination of both is equivalent to the sampler decision Drop
.
Check this unit test to see the effect of IsAllDataRequested
and ActivityTraceFlags
on child activities:
private static readonly ActivitySource _activitySource = new("Test");
[Theory]
[InlineData(false)]
[InlineData(true)]
public async Task Parent(bool resetRecordedFlag)
{
var exportedItems = new List<Activity>();
using (Sdk.CreateTracerProviderBuilder()
.AddSource("Test")
.AddInMemoryExporter(exportedItems)
.Build())
{
// the parent is created with IsAllDataRequested = true;
using var parent = _activitySource.StartActivity();
parent.ShouldNotBeNull();
parent.IsAllDataRequested.ShouldBeTrue();
// no configuration change on parent - the child created with IsAllDataRequested = true
using var firstChild = await CreateChild("FirstChild");
firstChild.ShouldNotBeNull();
firstChild.IsAllDataRequested.ShouldBeTrue();
// parent.IsAllDataRequested = false, but child created with IsAllDataRequested = true
parent.IsAllDataRequested = false;
using var secondChild = await CreateChild("SecondChild");
secondChild.ShouldNotBeNull();
secondChild.IsAllDataRequested.ShouldBeTrue();
if (resetRecordedFlag)
{
// after resetting ActivityTraceFlags.Recorded flag, all the children will be null
parent.ActivityTraceFlags &= ~ActivityTraceFlags.Recorded;
using var thirdChild = await CreateChild("ThirdChild");
thirdChild.ShouldBeNull();
}
}
// note the Parent is not exported regardless of the ActivityTraceFlags
// what's matter is IsAllDataRequested
exportedItems.Select(activity => activity.DisplayName).OrderBy(x => x).ToArray()
.ShouldBeEquivalentTo(new[] { "FirstChild", "SecondChild" });
}
private async Task<Activity> CreateChild(string name)
{
await Task.CompletedTask;
return _activitySource.StartActivity(name);
}
Also note that ActivityTraceFlags
is used to propagate the sampling or filter decision to the remote process, so they won't make the opposite decision.
The ActivityListener
with a specific configuration can be used instead of OTel SDK to get the same result.
private static readonly ActivitySource _activitySource = new("Test");
[Theory]
[InlineData(true, false)]
[InlineData(true, true)]
[InlineData(false, false)]
[InlineData(false, true)]
public async Task Parent(bool useOTel, bool resetRecordedFlag)
{
var exportedItems = new List<Activity>();
TracerProvider tracer = null;
if (useOTel)
{
tracer = Sdk.CreateTracerProviderBuilder()
.AddSource("Test")
.AddInMemoryExporter(exportedItems)
.Build();
}
else
{
Activity.DefaultIdFormat = ActivityIdFormat.W3C;
Activity.ForceDefaultIdFormat = true;
ActivitySource.AddActivityListener(new ActivityListener
{
ShouldListenTo = source => source.Name == "Test",
Sample = (ref ActivityCreationOptions<ActivityContext> options) =>
options.Parent == default ||
options.Parent.TraceFlags.HasFlag(ActivityTraceFlags.Recorded)
? ActivitySamplingResult.AllDataAndRecorded
: ActivitySamplingResult.None,
ActivityStopped = activity =>
{
if (activity.IsAllDataRequested)
{
exportedItems.Add(activity);
}
}
});
}
using (tracer)
{
// the parent is created with IsAllDataRequested = true;
using var parent = _activitySource.StartActivity();
parent.ShouldNotBeNull();
parent.IsAllDataRequested.ShouldBeTrue();
// no configuration change on parent - the child created with IsAllDataRequested = true
using var firstChild = await CreateChild("FirstChild");
firstChild.ShouldNotBeNull();
firstChild.IsAllDataRequested.ShouldBeTrue();
// parent.IsAllDataRequested = false, but child created with IsAllDataRequested = true
parent.IsAllDataRequested = false;
using var secondChild = await CreateChild("SecondChild");
secondChild.ShouldNotBeNull();
secondChild.IsAllDataRequested.ShouldBeTrue();
if (resetRecordedFlag)
{
// after resetting ActivityTraceFlags.Recorded flag, all the children will be null
parent.ActivityTraceFlags &= ~ActivityTraceFlags.Recorded;
using var thirdChild = await CreateChild("ThirdChild");
thirdChild.ShouldBeNull();
}
}
// note the Parent is not exported regardless of the ActivityTraceFlags
// what's matter is IsAllDataRequested
exportedItems.Select(activity => activity.DisplayName).OrderBy(x => x).ToArray()
.ShouldBeEquivalentTo(new[] { "FirstChild", "SecondChild" });
}
private async Task<Activity> CreateChild(string name)
{
await Task.CompletedTask;
return _activitySource.StartActivity(name);
}
Just for the record, I've not used OpenTelemetry much, and when I attempted to do so, I was unable to figure out how to view the logs in Azure. As far as I can tell it did not record any custom events. I'm sure I'm doing something wrong, but in any case it's obvious you have much more experience here than me. So I'm just providing another perspective here.
which will cause all the children's activities to be null
Hmm, I'm curious: in the context of OpenTelemetry, if a sample will not be recorded, doesn't StartActivity
return null just as if ActivityTraceFlags.Recorded
was not set?
Do you mean SamplingDecision.Drop
?
Sometimes .NET implementation deviates from the spec. For example, the activity must be created even when the trace is dropped to allow correlating logs. Also, setting the tag with null
value removes the tag from the activity, which is not spec compliment (I can't find the issue right now).
I found another problem in the current implementation. The GraphQLTelemetryProvider.StartActivityAsync
can't be async
. When calling ActivitySource.StartActivity
, it sets the newly created activity in Activity.Current
, which is AsyncLocal<>
. If the user overrides this method and makes it async
, the ExecuteAsync
will see its parent in Activity.Current
and not the newly created activity. This method should be changed to sync.
can’t be async
You’re right.
So in the case of SamplingDecision.Drop: ASP.NET is going to create the 'root' Activity, with IsAllDataRequested
false and Activity.Recorded
not set. Then control passes to GraphQL.NET and since Recorded is not set on the parent Activity, StartActivity
returns null?
Correct. And the Activity.Current
keeps the reference to Asp.Net created activity. It means all the downstream calls to ActivitySource.StartActivity
will return.
So from a practical viewpoint, IsAllDataRequested
will probably always be true
within GraphQL.NET, unless the filter were to change it (which is not currently possible as the filter runs before the activity is created).
Sounds correct. And I think setting IsAllDataRequested = false
without resetting Recorded
flag is a bad idea because the child activity will have its ParentId
set to an activity that won't be exported. Then, when building the trace visualization, this span will be shown as a root span in the trace, because its parent doesn't exist (that's what Jaeger and Tempo do).
But I need to verify yet if there are some legal usages for IsAllDataRequested = false
alone.
So considering all of our discussion here, what are our conclusions?
Suggestion | @Shane32 suggests | @gao-artur suggests |
---|---|---|
Filter occur after validation | No | |
Filter after creating Activity | Yes | |
Filter should clear Recorded, etc | Yes | |
StartActivity not async | Yes | |
AddTag vs SetTag | Leave as-is | |
Set tags when ... | as-is, or when IsAllDataRequested or Recorded | |
EnrichWith when ... | matching when setting tags | |
Add extension methods | Yes | |
Make TelemetryListener public | No |
Does filtering after creating an activity have a benefit? Seems like we would need to create the activity in order to set Recorded=false. But then we might as well create the activity first and pass it to the filter method as well. And if the activity is null we need not even call the filter method.
I still not convinced that we should not record any tags when IsAllDataRequested=false. It seems like quite a low-cost operation, and it only occurs a couple times per request. At the same time, the MS docs would indicate that tags don't need to be added when IsAllDataRequested is false. There's also Recorded=false - do we have any if conditions based on that flag? And the extension methods - should they execute regardless (letting user code check for IsAllDataRequested)? Can we imagine a use case where users want Enrich to run even if they are not adding tags? Maybe if we don't set any tags then we don't need to call enrich.
Does our decision apply equally to all EnrichWith? Seems like if the telemetry provider was recording metrics (in theory) with IsAllDataRequested=false and Recorded=True, seems like it would be important to call activity.SetStatus
, for example.
Maybe the only time we should skip calling EnrichWith or setting tags is when both IsAllDataRequested=false and Recorded=false
Updated my column.
Suggestion | @Shane32 suggests | @gao-artur suggests |
---|---|---|
Filter occur after validation | No | No |
Filter after creating Activity | Yes | |
Filter should clear Recorded, etc | Yes | Yes |
StartActivity not async | Yes | Yes |
AddTag vs SetTag | Leave as-is | Leave as-is |
SetTag when IsAllDataRequested=false | No | |
EnrichWith when IsAllDataRequested=false | No | |
SetTag when Recorded=false | Don't check this at all | |
EnrichWith when Recorded=false | Don't check this at all | |
Add extension methods | Yes | Yes |
Make TelemetryListener public | No | No |
create the activity first and pass it to the filter method as well
The filter shouldn't be based on activity, and the user shouldn't alter the activity from the filter. Also, it will be inconsistent with any other instrumentation providing a Filter
delegate.
And if the activity is null we need not even call the filter method.
Correct. Just skip all the telemetry logic.
I still not convinced that we should not record any tags when IsAllDataRequested=false
Yes, this part is the most doubtful. I base my opinion on other instrumentation implementations in the core and contrib repos, and all of them skip adding tags when IsAllDataRequested=false
. And if we skip adding tags, there is no point in giving the app writers to add their tags.
recording metrics
Unfortunately, I haven't worked with metrics yet and don't know how they work and what info they need, so I can't give any advice here.
when both IsAllDataRequested=false and Recorded=false
I think Recorded
flag is for SDK and shouldn't be examined by instrumentations but I'm not 100% sure.
Actually, traces and metrics are independent API's and we shouldn't consider them here. During the initial OTel PR review, there was a suggestion to add the word Trace
to the extension methods and options class name to extend the telemetry with metrics in the future.
Do you want to make a PR for review? Probably need to target develop if we are changing the StartActivity method signature.
Also if we allow derived classes from GraphQLTelemetryProvider
then we need to consider the effect when someone calls UseTelemetry<T>
on a derived class after OpenTelemetry has already run EnableAutoInstrumentation
and the base class was already added to the pipeline. As of now, TryRegister prevents it from being registered twice.
You could also separate into separate PRs the different feature changes we have discussed, for easier review and approval (??) - your discretion.
Also if we allow derived classes from
GraphQLTelemetryProvider
then we need to consider the effect when someone callsUseTelemetry<T>
on a derived class after OpenTelemetry has already runEnableAutoInstrumentation
and the base class was already added to the pipeline. As of now, TryRegister prevents it from being registered twice.
I'll take a whack at this
Sure, will do. Is it ok to add a sync StartActivity
and mark the StartActivityAsync
as obsolete to allow this this change in v7? The StartActivityAsync
will call the sync version internally.
And what about the change in the Filter
behavior? Do you consider it a breaking change?
Filter: In the current implementation,
GraphQLTelemetryOptions.Filter
acts more likeSkip
orSuppress
. It only skips the GraphQL span creation, but all the following spans will be created as usual. TheFilter
option is usually used to "cut" the span creation from this point, so, for example, if the graph resolver makes a DB roundtrip, this won't be recorded. See my implementation in theHangfire
instrumentation, for example (which is based on how this feature is implemented in other instrumentations). When is this useful? For example, we have a few mutations that, in some cases, can create a huge trace (20k+ spans). Parts of these traces can be dropped because of memory limits, breaking the trace and making it useless. We want to be able to filter these traces and not create them at all based on mutation names.EnrichWith*: Currently,
EnrichWith*
delegates are executed regardless of theActivity.IsAllDataRequested
state which is a waste of CPU. When theIsAllDataRequested = false
, the span (activity) won't be recorded anyway, so adding tags is useless. Note that the recommended way to add tags is by calling SetTag method that performs tags dedup, which is fast but not free.Set*TagsAsync: Similar to
EnrichWith*
, theSet*TagsAsync
should be called only whenActivity.IsAllDataRequested = true
.