Closed object closed 1 year ago
Ok, I'll look forward to meeting with you on Friday - my first thought here is that these should possibility be filtered out via OpenTelemetry's trace Processor
s, which include support for filtering out spans https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/6341
I'll see if I can prototype something in that regard today. I just wrote an implementation last week to use OTel metrics views to filter out unwanted metrics: https://phobos.petabridge.com/articles/trace-filtering.html#filtering-metrics
My worry is that dropping 1-2 spans in the middle of a transaction might break the entire trace, but I'll have to test it
I'll put together a prototype prior to our meeting and report my findings on this thread.
My worry is that dropping 1-2 spans in the middle of a transaction might break the entire trace, but I'll have to test it
This is my fear too because the 4 spans above form a parent-child hierarchy: 1 to 2 to 3 to 4. So simply filtering out two intermediate spans can break the whole chain.
If all spans belonged to the same node, I could simply pass the 1st span's resuable context to the 4th span and the transaction would continue. But spans 2-3 are on the edge of the cluster node.
What I would like to have is to be able to get hold of the UsableContext that is passed to the auto-generated span for the shard message without auto-generating those spans. That would keep only explicitly created spans in the trace transaction shaping it perfectly.
If all spans belonged to the same node, I could simply pass the 1st span's resuable context to the 4th span and the transaction would continue. But spans 2-3 are on the edge of the cluster node.
With Akka.Cluster.Sharding there are no guarantees that will be the case
So it looks like using an OpenTelemetry filter processor works:
You'll get some of these errors inside Jaeger, but the trace still renders:
invalid parent span IDs=f99ad359a9f63e3d; skipping clock skew adjustment
My source code for the Processor
:
/// <summary>
/// Excludes all spans created by the built-in sharding actors
/// </summary>
public class ExcludeShardingProcessor : CompositeProcessor<Activity>
{
public override void OnEnd(Activity data)
{
if (data.Tags.Any(c => c.Key.Equals("akka.actor.type")
&& c.Value != null
&& (c.Value.Contains("Akka.Cluster.Sharding"))))
return; // filter out
base.OnEnd(data);
}
public ExcludeShardingProcessor(IEnumerable<BaseProcessor<Activity>> processors) : base(processors)
{
}
}
note that this has to be a
CompositeProcessor
in order for filtration to work - relevant explanation https://github.com/open-telemetry/opentelemetry-dotnet/blob/4310e08afdd3a6d2533836c3d71a6f56cf3148ef/docs/trace/extending-the-sdk/MyFilteringProcessor.cs#L26-L33:
And then integrating the processor into my APM pipeline:
var resource = ResourceBuilder.CreateDefault()
.AddService(Assembly.GetEntryAssembly().GetName().Name, serviceInstanceId: $"{Dns.GetHostName()}");
// enables OpenTelemetry for ASP.NET / .NET Core
services.AddOpenTelemetryTracing(builder =>
{
builder
.SetResourceBuilder(resource)
.AddPhobosInstrumentation()
.AddHttpClientInstrumentation()
.AddAspNetCoreInstrumentation()
.AddProcessor(new ExcludeShardingProcessor(new[]
{ new SimpleActivityExportProcessor(new JaegerExporter(new JaegerExporterOptions())) }));
});
You have to wrap your OTEL exporter inside the filter per the OTEL .NET APIs, which for some reason is causing my ResourceBuilder
information to no longer appear inside my traces. I'm going to file an issue for that.
Reported issue in the OTEL repo: https://github.com/open-telemetry/opentelemetry-dotnet/issues/3603
Also, here's my PR where all of the work was performed: https://github.com/petabridge/akkadotnet-code-samples/pull/104
Thank you so much @Aaronontheweb for such thorogh investigation. I am going to give it a try.
@Aaronontheweb I tried using a custom processor inherited from CompositeProcessor
Example: I let Phobos manage my traces by setting create-trace-upon-receive to "on". It generated the following trace transaction chain (showing operation names):
akka.msg.recv Nrk.Oddjob.Ps.PsGranittActors+GranittCommand+GetProgramRights
akka.msg.recv Microsoft.FSharp.Core.FSharpResult<Microsoft.FSharp.Collections.FSharpList<Nrk.Oddjob.Ps.PsTypes+PsProgramRights>, System.Exception>
akka.msg.recv Nrk.Oddjob.Ps.PsGranittActors+GranittCommand+GetProgramActiveFiles
akka.msg.recv Microsoft.FSharp.Core.FSharpResult<Microsoft.FSharp.Collections.FSharpList<Nrk.Oddjob.Ps.PsTypes+PsProgramFiles>, System.Exception>
akka.msg.recv Nrk.Oddjob.Ps.PsTypes+PsPrepareArchiveCommand
akka.msg.recv Nrk.Oddjob.Ps.PsGranittActors+GranittCommand+GetArchiveFiles
akka.msg.recv Microsoft.FSharp.Core.FSharpResult<Nrk.Oddjob.Ps.PsTypes+PsProgramArchiveFiles, System.Exception>
akka.msg.recv Microsoft.FSharp.Core.FSharpResult<Microsoft.FSharp.Core.Unit, System.Exception>
akka.msg.recv Nrk.Oddjob.Ps.PsTypes+PsArchiveLookupCommand
akka.msg.recv Nrk.Oddjob.Ps.PsGranittActors+GranittCommand+GetProgramActiveFiles
akka.msg.recv Microsoft.FSharp.Core.FSharpResult<Microsoft.FSharp.Collections.FSharpList<Nrk.Oddjob.Ps.PsTypes+PsProgramFiles>, System.Exception>
akka.msg.recv Microsoft.FSharp.Core.FSharpResult<Nrk.Oddjob.Ps.PsTypes+FileSetInfo, System.Exception>
akka.msg.recv Nrk.Oddjob.Ps.PsGranittActors+GranittCommand+GetPsPublishJobInfo
akka.msg.recv Microsoft.FSharp.Core.FSharpResult<Nrk.Oddjob.Ps.PsGranittActors+PsPublishJobInfo, System.Exception>
akka.msg.recv Nrk.Oddjob.Ps.PsTypes+PlayabilityReminderAck
Note there are many messages of Microsoft.FSharp.Core.FSharpResult type. So I added a filter to my processor to filter out spans with condition data.OperationName.Contains("Microsoft.FSharp.Core.FSharpResult"). Then the following traces were generated:
akka.msg.recv Nrk.Oddjob.Ps.PsGranittActors+GranittCommand+GetProgramRights
akka.msg.recv Nrk.Oddjob.Ps.PsGranittActors+GranittCommand+GetProgramActiveFiles
akka.msg.recv Nrk.Oddjob.Ps.PsTypes+PsPrepareArchiveCommand
akka.msg.recv Nrk.Oddjob.Ps.PsGranittActors+GranittCommand+GetArchiveFiles
akka.msg.recv Nrk.Oddjob.Ps.PsTypes+PsArchiveLookupCommand
akka.msg.recv Nrk.Oddjob.Ps.PsGranittActors+GranittCommand+GetPsPublishJobInfo
akka.msg.recv Nrk.Oddjob.Ps.PsTypes+PlayabilityReminderAck
So all Microsoft.FSharp.Core.FSharpResult are gone, but trace transaction is broken in their places.
Unfortunately this doesn't solve the issue.
Hmm, reading release notes for latest OpenTelemetry: https://www.opensourceagenda.com/projects/opentelemetry-dotnet/versions
"CompositeProcessor will now ensure ParentProvider is set on its children (#3368)"
I am on an earlier version. Will need to try the latest one.
Nope, upgrading to OpenTelemetry version 1.4.0-alpha.2 didn't help. Using CompositeProcessor to filter some messages breaks trace transaction in those places.
You can probably use a Processor
to rename those FSharpResult<T>
traces
@Aaronontheweb names are not a problem. I chose FSharpResult as an example of message types to try to filter out, and it broke the trace transaction. If I instead filter out tags with "Akka.Cluster.Sharding", the transaction will also be broken. As long as it's impossible to retain span hierarchy when excluding certain spans, then it's of a little use to compose distributed traces.
The more I look into it, the more I doubt anything can be altered using CompositeProcessor. It works with activities that already set up and there's not much can be altered for them. For example, although Activity has a SetParentId method, documentation states that "This method should only be used before starting the Activity object. This method has no effect if you call it after the Activity object has started".
Living with some of the noise from Phobos might be the best option in this case then unfortunately
I'm looking into two approaches here, revisiting this issue after some months away from it:
The problem we have with the "hiding" approach in Sharding is the buffering inside the ShardRegion
actors during remote deliveries while shards are being allocated - that's going to require the span data to be stored somewhere (which we do via automatically created spans) since the actual delivery isn't a single-shot operation. I have some ideas for working around that but this is the fundamental challenge.
The "compression" approach would work after the fact - I'd have to buffer groups of unprocessed spans that all are all gathered locally on a single node and flatten them into a single operation. I haven't looked at how feasible that is exactly but I think it might actually be simpler to implement and will probably reduce the performance impact on the sharding system.
I believe our solution to https://github.com/petabridge/phobos-issues/issues/69 addresses this too - but if that solution doesn't quite work we have some more ideas for hardening it.
I believe this is rather specific question related to manual span creation, so I booked an hour with @Aaronontheweb to demonstrate and discuss this issue. I will briefly describe it here.
Setup: We have an Akka cluster where incoming requests (via RabbitMQ) spawn transactions consisting of series of messages sent between several cluster nodes using cluster sharding. We took over the trace span creating by setting create-trace-upon-receive to "off". We are using latest Phobos release (2.1.0 beta1).
Issue: Every shard message adds 2 additional short spans (duration less than 1ms) to the trace transaction.
Example: 1 span. A queue message is fetched from a RabbitMQ queue. We start a new activity. Message is sent to a handler via cluster sharding 2 span. A new span added with actorType "Akka.Cluster.Sharding.ShardRegion" and actor path "/system/sharding/mediasetProxy" (this is a shard proxy) 3 span. A new span added with actorType "Akka.Cluster.Sharding.ShardRegion" and actor path "/system/sharding/mediaset" (this is the actual shard, not proxy) 4 span. A message is received by the handler of shard message which creates its own custom trace span.
A full trace transaction typically contains 40-50% auto-generated spans from shard messages (2 and 3 above). I checked one such transaction, it had 89 trace spans, and 40 of them were auto-generated shard messages with a name "akka.msg.recv XXX" where XXX is a name of a shard message type.
Prior to release of Phobos 2.1 no such messages were present in a trace but there was a bug in an implementation and sending shard messages broke trace transaction. New version fixed this problem, we now manage to compose long trace transactions from the spans we create ourselves, but ability to suppress auto-generated "akka.msg.recv XXX" messages from the trace will make it more compact and easier to understand.