microsoft / ApplicationInsights-ServiceFabric

ApplicationInsights SDK for ServiceFabric projects
MIT License
63 stars 26 forks source link

Operation to stop does not match the current operation. #64

Closed Tadimsky closed 6 years ago

Tadimsky commented 6 years ago

I'm getting an error using the Service Remoting telemetry. It appears that the operation that is being stopped is invalid:

"message": "AI (Internal): Operation to stop does not match the current operation. Details: Telemetry Id '|1cafaa09-478a33c4492e47c3.a2a37cb4_' does not match current Activity '|1cafaa09-478a33c4492e47c3.a2a37cb4_1.'",

It looks like Application Insights is trying to stop a2a37cb4_ but it's currently running activity a2a37cb4_1. Notice the 1 at the end.

Any ideas why this would be happening? I'm using 2.1.1-beta1

SergeyKanzhelev commented 6 years ago

CC: @lmolkova

lmolkova commented 6 years ago

It seems like a known issue: https://github.com/Microsoft/ApplicationInsights-dotnet/issues/693 Which is likely a side effect of https://github.com/Microsoft/ApplicationInsights-dotnet-server/issues/723, that reproduces when debugging the app in Visual Studio.

If it is this issue, it should be fixed in Microsoft.ApplicationInsights.DependencyCollector 2.6.0-beta3 and Microsoft.ApplicationInsights.AspNetCore 2.3.0-beta1.

@Tadimsky could you please check versions of DependencyCollector and AspNetCore (and update them). Also, do you know if the issue reproduces locally under debug, or in production?

If it reproduces with latest versions, could you please check which if dependency telemetry was reported on the same machine immediately after this message? (unless you have very high-loaded service and this does not make sense?)

Could you please also share some of your telemetries for this operation?

Please use following Analytics queries

  1. requests | union dependencies | where id == "|1cafaa09-478a33c4492e47c3.a2a37cb4_"
  2. requests | union dependencies | where id == "'|1cafaa09-478a33c4492e47c3.a2a37cb4_1."
  3. requests | union dependencies | union traces | where operation_Id == "1cafaa09-478a33c4492e47c3.a2a37cb4"

Please remove any sensitive information like iKey (or anything else). You can also email this info to me.

Tadimsky commented 6 years ago

Hey @lmolkova this is happening for me in our Production services:

AI (Internal): Operation to stop does not match the current operation. Details: Telemetry Id '\ fac5592a-4fd80273c9a61f11.80081fb1_' does not match current Activity '\ fac5592a-4fd80273c9a61f11.80081fb1_1.'  

I'll send you an email with the traces.

lmolkova commented 6 years ago

@Tadimsky thanks for the info!

The error happens when telemetry we are going to track in StopOperation does not match current Activity. We have tracked such telemetries before (by mistake) and started to enforce this in the latest ApplicationInsights Base SDK. Otherwise, it leads to broken correlation.

So, what I see from the logs is:

Service Whiteboards API makes remoting call to SharesService (I’m omitting not interesting pieces)

Id Type OperationId ParentId
|fac5592a-4fd80273c9a61f11.3.1. dependency fac5592a-4fd80273c9a61f11 |fac5592a-4fd80273c9a61f11.3.
|fac5592a-4fd80273c9a61f11.80081fb1_1. request fac5592a-4fd80273c9a61f11 |fac5592a-4fd80273c9a61f11.3.

Eventually when something with Id |fac5592a-4fd80273c9a61f11.80081fb1_ is being reported, it could not, because Activity with |fac5592a-4fd80273c9a61f11.80081fb1_1 is not stopped.

Some strange things:

  1. Dependency has ParentId |fac5592a-4fd80273c9a61f11.3. – it has never been reported
  2. Request parent is |fac5592a-4fd80273c9a61f11.3. too! While dependency id is |fac5592a-4fd80273c9a61f11.3.1.
  3. Why request Id is fac5592a-4fd80273c9a61f11.80081fb11? It should be fac5592a-4fd80273c9a61f11.80081fb1

So my assumption is: Activities are created multiple times: e.g. when incoming request starts, we receive several event source events about it. So we attempt to start several nested activities. Same happens for outgoing requests. This could explain all above discrepancies.

According to @Tadimsky, each service is packed into it's own package and each instance runs on it's own machine, i.e. it does not look like SharedProcess SF issue.

/cc @brahmnes @yantang-msft

an4286sc commented 6 years ago

@lmolkova I'm getting this error message too in my current project. But only in production. Not while debugging locally. I'm using this telemetry client methods: TrackAction TrackTrace TrackException

No StartOperation or StopOperation.

The Application Insights message is: "AI (Internal): Operation to stop does not match the current operation." No further informations.

yantang-msft commented 6 years ago

@an4286sc What's the service fabric runtime and AI ServiceFabric version you are on? Are you using Reliable Service or Actors? Remoting V1 or V2? Did you by any chance use Shared Process hosting model in Servicefabric. The StartOperation and StopOperation will be called automatically when the request is coming to your service or when you make outgoing request.

Would you be able to share the project that can reproduce the issue? That would be the easiest way for us to trouble shoot.

Other than that, when you see the "AI (Internal): Operation to stop does not match the current operation.", it should give you the TelemetryId, e.g., '|fac5592a-4fd80273c9a61f11.80081fb1_', could you run analytic query like below, this will give us all telemetries happened during the request.

requests | union dependencies |
union traces | where operation_Id ==
"fac5592a-4fd80273c9a61f11"
an4286sc commented 6 years ago

@yantang-msft Thanks for your reply. The error occurs in the stateless serivce with remoting v2 and shared process hosting. We are using AI Service Fabric 2.1.1-beta1.

See the query results in the attached file. ai_stop_operation_failure.zip

yantang-msft commented 6 years ago

@an4286sc I do see discrepancies of the operation ids. And what's the version of Microsoft.ServiceFabric.Services.Remoting? Can you try exclusive process model and see if that solves the problem? I can try the shared process model later this week and see if I can reproduce the issue. If so, we will keep investigate in how to make it work with shared process model.

yantang-msft commented 6 years ago

@Tadimsky @an4286sc I have reproduced this issue and identified the root cause, it's happening for me after I upgraded Microsoft.ApplicationInsights.AspNetCore 2.3.0-beta2. Hopefully you're seeing this error for the same reason.

We are working on the fix, but for some process reason, it will take some time before we can release the fix. A temporary workaround would be downgrade the Microsoft.ApplicationInsights.AspNetCore to 2.2.1.

Tadimsky commented 6 years ago

@yantang-msft I am on 2.3.0-beta1, do you think this version is affected as well?

yantang-msft commented 6 years ago

@Tadimsky Yes, I confirm 2.3.0-beta1 is also affected.

Tadimsky commented 6 years ago

@yantang-msft is this expected to be fixed in 2.1.1? I'm still seeing the issue:

image

image

yantang-msft commented 6 years ago

@Tadimsky Yes, the bug as AI SF side is fixed. Does the huge decrease after 6/25 happened after you upgrade the AI SF package? It could be somewhere else unrelated to AI SF that is still causing this issue. If you can provide a reproduce, I can definitely take a look and see what's wrong.

Tadimsky commented 6 years ago

The huge decrease was due to some increased load we had to process for a few days.

This is total requests into the service: image

We did deploy the fix on 6/27 though - it does seem to be lower, but it's still occurring: image

The project is the same one I shared with you earlier. Is there something I can try to do to debug this?

yantang-msft commented 6 years ago

@Tadimsky can you run the query as you did last time and share the result? You can try remove the AI SF nuget package and see if this still happens. If so then it's could be some other AI package or if you manually called client.StartOperation but didn't call client.StopOperation properly.