microsoft / ApplicationInsights-dotnet

ApplicationInsights-dotnet
MIT License
565 stars 287 forks source link

RichPayloadEventSource needs to be aware if a ITelemetry object has been sampled out #640

Closed juliusl closed 2 years ago

juliusl commented 6 years ago

WRT TelemetryClient.Track(),

https://github.com/Microsoft/ApplicationInsights-dotnet/blob/c06a87bfb6ac4d28129f6e3da7c6c0f7e02d623c/src/Microsoft.ApplicationInsights/Managed/Shared/TelemetryClient.cs#L404-L420

Since the call to RichPayloadEventSource is outside of the Telemetry Processor chain, there is no way for consumers of the event source to know if a piece of telemetry has been sampled out. The only workaround for this is for consumers of the event source to recompute the hash to compare it against the SamplingRate in the ITelemetry object.

This basically doubles the compute across the board if consumers want this information.

As a solution I propose we save state of whether the telemetry object has been sampled out.

Either:

  1. Set SamplingPercentage to 0 or -1 if the object has been sampled out (Sergey's idea).
  2. Add a seperate boolean.

This would go here somewhere:

https://github.com/Microsoft/ApplicationInsights-dotnet/blob/c06a87bfb6ac4d28129f6e3da7c6c0f7e02d623c/src/ServerTelemetryChannel/Managed/Shared/SamplingTelemetryProcessor.cs#L187-L205

Are there any issues with this?

@northtyphoon Do you have any additional comments?

northtyphoon commented 6 years ago

In Sanpshot Debugger test, we see sometimes it took a lot of effort to take and upload a snapshot but it’s useless because the exception event itself is sampled out.

So we also need to define a protocol between SamplingTelemetryProcessor and SnapshotCollectorTelemetryProcessor to coordinate the sampling. One possible solution top of my head is check the exception data and skip sampling process if it has snapshotcontext.

SergeyKanzhelev commented 6 years ago

@juliusl I've been thinking about it more and wonder what's the main use case for you?

You should not assume there is a single sampling processor. There maybe few of them chained for different event types. There are also may be a filtering processor that just throw telemetry item out. And there is no control over it in out current design.

Other thing to watch for is multiple sinks. I believe Rich EventSource will pick up changes made by default sink only. Just something to watch for with this design.

@northtyphoon It’s easy to force exception telemetry to be collected. Just set SamplingPercentage to 100 and it will not be sampled.

Other question is that it may not be that useful to have an exception without related request and other related telemetry. Especially in case of distributed traces. Can you use the same sampling algorithm as used in SDK? Maybe with more aggressive settings.

zakimaksyutov commented 6 years ago

@juliusl, DJB2 hash is not crypto hash and as a result it is extremely fast. I don't think there is a problem of computing it as many times as we need to.

Note, knowing whether event was sampled out or not might be not enough when we're talking about distributed traces. For instance, on FrontEnd there is not much load and sampling is off. But on BackEnd there is sampling @ 10%. Since in UX we will give preference to samples with higher probability to be sampled in on all components/roles it would be great if gave the same preference for Service Profiler/Snapshot Debugger.

juliusl commented 6 years ago

Thanks for the replies,

@SergeyKanzhelev My main use case is to pick the "best" events from the stream of events coming from AI. Part of my criteria of "best" is that it ends up in the portal. Taking into account the back end sampling, If I can have even a 90% confidence that a event makes it to the portal, then that makes that event interesting. After I know something is interesting I can take it from there. Taking into account that there can be more than one filtering/sampling processor then I propose we should have an "excluded" boolean on the ITelemetry interface that any of those processors can set. I assume in the pipeline there is a .Reject() method when a component in the pipeline decides to toss an event?

@zakimaksyutov I'm mostly fine with shipping with the additional hashing but I think we should make improvements to the pipeline overall. Since the SDK can change hashing at any time that makes it makes it challenging for consumers to stay in sync. Adding additional information to an event before it makes it downstream seems trivial enough that we should just go ahead and do it. Also, even though it is fast it's still double the compute on the client's machine so if we can avoid doing any work I think that is better long term.

So to wrap up my new proposal is that we add:

  1. A boolean to indicate if a piece of telemetry has been excluded in submission to AI telemetry ingestion.
  2. A reject method in the pipeline if it doesn't already exist or some similar mechanism.
SergeyKanzhelev commented 6 years ago

There is no Reject in current design. And it's border line impossible to get this flag work 100% reliably. So I'd suggest try some different approaches

juliusl commented 6 years ago

@SergeyKanzhelev That is unfortunate. I saw that there was a Next() in the pipeline so I was hoping for a Reject() as well.

I'm still not following why it would be 100% impossible but I guess I could review the pipeline code a bit more to get on the same page.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 300 days with no activity. Remove stale label or comment or this will be closed in 7 days.