Adding support for end-to-end activity tracing

kzu commented 4 years ago

When StreamJsonRpc is used to power microservices-based architectures, the need for rich (end-to-end) tracing and activity propagation, to enable better troubleshooting and monitoring becomes quite important. This has been an explicit focus in .NET/ASP.NET Core 3.0 and with the brand-new W3C Trace Context recommendation there's an opportunity to introduce this in StreamJsonRpc in a standards-compliant way.

The goal of a built-in activity tracing support would be to enable rich tracing that can correlate calls across multiple remote invocations. This could then be used, for example, in System.Diagnostics.TraceSource for TraceTransfer/Start/Stop which model activities, or EventSource.SetCurrentThreadActivityId to do activity-based logging in ETW traces, or the newer Activity in .NET Core that has built-in support for the W3C standard.

Currently, there doesn't seem to be any specific support for propagating activity identifiers across RPC calls (just a request identifier). One option would be to leverage the fact that the request id does not need to have a specific format and therefore could accomodate a proper W3C trace context.

Attempting to add this support from outside of StreamJsonRpc (in a spike) turned out to be very tricky/hacky and not very satisfactory, which is why I think it's best if this was supported built-in.

This support would allow tools that can properly correlate and visualize traces to provide improved diagnostics like the Service Trace Viewer in this sample screenshot from a spike:

AArnott commented 4 years ago

StreamJsonRpc already raises TraceSource and EventSource events including with Start/Stop semantics so that PerfView can show a request->response as an activity. Does what you're proposing go beyond that?

kzu commented 4 years ago

How can that be used to correlate across invocations across processes that can be happening simultaneously? I mean precise correlation, not just heuristic based on timing...

AArnott commented 4 years ago

I don't think that's possible today. Are you proposing adding fields to the JSON-RPC messages to carry the correlation id?

kzu commented 4 years ago

I know it's not possible, which is why I created this issue 😉. The System.Diagnostics.Activity.Id basically implements the scenarios in the W3C Trace Context recommendation so that activities (or operations) can properly be tracked across microservices invocations. They have an ActivityUserGuide that explains (in an HTTP-based scenario) how it works and how it's used, but I think the same concept (if not the implementation? or part of it?) applies to non HTTP-based distributed services, such as StreamJsonRpc.

So, we could borrow the entire implementation as-is from the Activity class itself (it's NS2 compatible), but (say) scope it to just an augmentation mechanism for the request ID and nothing more. Or, we may decide that both the Id and (correlation) Context are useful (see the correlation protocol) or maybe even the tags and baggage features. I don't think those add as much low-hanging-fruit as the consistent, standard and auto-propagating request ID, though.

kzu commented 4 years ago

So, bottom-line: my proposal is to just incorporate the hierarchical request ID propagation and reuse the existing request ID property to do so. Then TraceSource/EventSource/DiagnosticSource(Activity) events could be automatically emitted to light up activity tracing for all three APIs.

BertanAygun commented 4 years ago

With the assumption that we want this to be transparent to users of proxies, is there a list of activity ids we need to set/unset so that we cover TraceSource, EventSource and other loggers. The ones I found were:

https://docs.microsoft.com/en-us/dotnet/api/system.diagnostics.correlationmanager.activityid?view=netcore-3.1 for TraceSource
https://docs.microsoft.com/en-us/dotnet/api/system.diagnostics.tracing.eventsource.setcurrentthreadactivityid?view=netcore-3.1 for EventSource. Couldn't find any documentation implying these 2 work together.
For ILogger, I also couldn't find anything to set the correlation id or activity id.

kzu commented 4 years ago

I think we should also cover the new-ish DiagnosticSource and its Activity, since it seems it's the one being most up-to-date with regards to standards (W3C) and community efforts (such as OpenTelemetry). This could be useful as a guideline for the implementation: aspnetcore diagnostics+activity.

I think it would be more important to support DiagnosticSource/Activity than ILogger...

AArnott commented 4 years ago

This work is starting in my trace-context branch.

kzu commented 4 years ago

It's great to see this happening. .NET5 is a big jump forward in terms of standards-based tracing and it's great that a key component in the VS ecosystem will also be onboard with that 👍

AArnott commented 4 years ago

I have TraceSource support built up. The user has to come up with their own value for parentid. There's no public API in .NET Framework itself or StreamJsonRpc yet, to track what would be the value for tracestate. I could create my own AsyncLocal in StreamJsonRpc but as it should be shared across many libraries, that's not the best place to define it. The best place may be where it already is: Activity.TraceStateString, but that would involve loading another assembly. It may be worth it. The other downside of using the Activity API is it's less efficient: it allocates several strings as part of parsing/rendering the IDs that my own implementation avoids. That too may not be significant enough to avoid reuse of the public API.

That leaves one more concern: how to reconcile Trace.CorrelationManager.ActivityId with Activity.Current ? Should we embrace one and altogether forget the other? Neither API offers events that would allow us to 'fix up' the other one to keep it in sync. If we have to pick one, I'd prefer to pick TraceSource since we already use that everywhere and it avoids the assembly load. But then, Activity is clearly newer and may matter more in the future. Ah, dilemmas.

kzu commented 4 years ago

Since there's no currently existing tooling that allows you to make sense of the current ActivityId, I'd say drop it and see who complains ;). I doubt there's much usage (that can be checked via telemetry, btw). If usage is low/non-existent, there's no dilemma: go Activity all the way. There's going to be much more tooling around that from first and third parties, so I'd say it's worth sacrificing a bit of optimizations (which I'm sure will eventually come to Activity too, since it's a key part of ASP.NET core E2E tracing too).

Just my 2cents :)

AArnott commented 4 years ago

Thanks for your thoughts, @kzu. I'm leaning toward Activity support as well, but am working for some consensus within VS. In the meantime, I pushed a big change last night to my branch that puts the decision in the consumer's hands: StreamJsonRpc now has a plugin architecture where the JsonRpc instance owner can decide whether to go with CorrelationManager or Activity, or even a 3rd approach that they code up themselves.

kzu commented 4 years ago

Taking the decision out of the library seems like a good way to avoid having to mandate one over the other 👍

microsoft / vs-streamjsonrpc

Adding support for end-to-end activity tracing #470