Closed kzu closed 4 years ago
StreamJsonRpc already raises TraceSource and EventSource events including with Start/Stop semantics so that PerfView can show a request->response as an activity. Does what you're proposing go beyond that?
How can that be used to correlate across invocations across processes that can be happening simultaneously? I mean precise correlation, not just heuristic based on timing...
I don't think that's possible today. Are you proposing adding fields to the JSON-RPC messages to carry the correlation id?
I know it's not possible, which is why I created this issue 😉. The System.Diagnostics.Activity.Id basically implements the scenarios in the W3C Trace Context recommendation so that activities (or operations) can properly be tracked across microservices invocations. They have an ActivityUserGuide that explains (in an HTTP-based scenario) how it works and how it's used, but I think the same concept (if not the implementation? or part of it?) applies to non HTTP-based distributed services, such as StreamJsonRpc.
So, we could borrow the entire implementation as-is from the Activity
class itself (it's NS2 compatible), but (say) scope it to just an augmentation mechanism for the request ID and nothing more. Or, we may decide that both the Id
and (correlation) Context
are useful (see the correlation protocol) or maybe even the tags and baggage features. I don't think those add as much low-hanging-fruit as the consistent, standard and auto-propagating request ID, though.
So, bottom-line: my proposal is to just incorporate the hierarchical request ID propagation and reuse the existing request ID property to do so. Then TraceSource/EventSource/DiagnosticSource(Activity) events could be automatically emitted to light up activity tracing for all three APIs.
With the assumption that we want this to be transparent to users of proxies, is there a list of activity ids we need to set/unset so that we cover TraceSource, EventSource and other loggers. The ones I found were:
I think we should also cover the new-ish DiagnosticSource and its Activity, since it seems it's the one being most up-to-date with regards to standards (W3C) and community efforts (such as OpenTelemetry). This could be useful as a guideline for the implementation: aspnetcore diagnostics+activity.
I think it would be more important to support DiagnosticSource/Activity than ILogger...
This work is starting in my trace-context
branch.
It's great to see this happening. .NET5 is a big jump forward in terms of standards-based tracing and it's great that a key component in the VS ecosystem will also be onboard with that 👍
I have TraceSource
support built up. The user has to come up with their own value for parentid
.
There's no public API in .NET Framework itself or StreamJsonRpc yet, to track what would be the value for tracestate
. I could create my own AsyncLocal
in StreamJsonRpc but as it should be shared across many libraries, that's not the best place to define it. The best place may be where it already is: Activity.TraceStateString
, but that would involve loading another assembly. It may be worth it. The other downside of using the Activity
API is it's less efficient: it allocates several strings as part of parsing/rendering the IDs that my own implementation avoids. That too may not be significant enough to avoid reuse of the public API.
That leaves one more concern: how to reconcile Trace.CorrelationManager.ActivityId
with Activity.Current
? Should we embrace one and altogether forget the other? Neither API offers events that would allow us to 'fix up' the other one to keep it in sync. If we have to pick one, I'd prefer to pick TraceSource
since we already use that everywhere and it avoids the assembly load. But then, Activity
is clearly newer and may matter more in the future. Ah, dilemmas.
Since there's no currently existing tooling that allows you to make sense of the current ActivityId, I'd say drop it and see who complains ;). I doubt there's much usage (that can be checked via telemetry, btw). If usage is low/non-existent, there's no dilemma: go Activity all the way. There's going to be much more tooling around that from first and third parties, so I'd say it's worth sacrificing a bit of optimizations (which I'm sure will eventually come to Activity too, since it's a key part of ASP.NET core E2E tracing too).
Just my 2cents :)
Thanks for your thoughts, @kzu. I'm leaning toward Activity
support as well, but am working for some consensus within VS.
In the meantime, I pushed a big change last night to my branch that puts the decision in the consumer's hands: StreamJsonRpc now has a plugin architecture where the JsonRpc
instance owner can decide whether to go with CorrelationManager
or Activity
, or even a 3rd approach that they code up themselves.
Taking the decision out of the library seems like a good way to avoid having to mandate one over the other 👍
When StreamJsonRpc is used to power microservices-based architectures, the need for rich (end-to-end) tracing and activity propagation, to enable better troubleshooting and monitoring becomes quite important. This has been an explicit focus in .NET/ASP.NET Core 3.0 and with the brand-new W3C Trace Context recommendation there's an opportunity to introduce this in StreamJsonRpc in a standards-compliant way.
The goal of a built-in activity tracing support would be to enable rich tracing that can correlate calls across multiple remote invocations. This could then be used, for example, in System.Diagnostics.TraceSource for TraceTransfer/Start/Stop which model activities, or EventSource.SetCurrentThreadActivityId to do activity-based logging in ETW traces, or the newer Activity in .NET Core that has built-in support for the W3C standard.
Currently, there doesn't seem to be any specific support for propagating activity identifiers across RPC calls (just a request identifier). One option would be to leverage the fact that the request id does not need to have a specific format and therefore could accomodate a proper W3C trace context.
Attempting to add this support from outside of StreamJsonRpc (in a spike) turned out to be very tricky/hacky and not very satisfactory, which is why I think it's best if this was supported built-in.
This support would allow tools that can properly correlate and visualize traces to provide improved diagnostics like the Service Trace Viewer in this sample screenshot from a spike: