microsoft / service-fabric-issues

This repo is for the reporting of issues found with Azure Service Fabric.
168 stars 21 forks source link

CancellationToken for RunAsync not getting cancelled by SF #1625

Closed Jaans closed 4 years ago

Jaans commented 4 years ago

Recently 2 of our 300+ services have stopped receiving cancellation tokens when the Service Fabric closes down our service (whether it be because of move or instance removal).

They all used to work just fine (i.e. always got their RunAsync( token ) token cancelled), but we suspect we introduced a change somewhere that is causing our issue. Unfortunately we are struggling to identify what that change might be, even for one of the services that is very simple.

For example, if we remove a service instance, it doesn't terminate "gracefully" like all the others anymore, it just sits there. Eventually (maybe 2 - 5 minutes) SF seems to "hard terminate" it, but the service still does not get a cancellation for it's token though (at least as far as we can tell).

It almost seems like something in our code is holding on to a WcfCommunicationClient / Listener and SF doesn't cancel the token because it thinks the service is making an active call. Conjecture at this point.

Attaching a debugger only shows that there is a single worker thread waiting for an I/O completion (hence my comment above about something holding on).

We have this happening consistently on all clusters (OneBox/DevClusters and Test / Prod clusters).

I was hoping to get some guidance on how we might be able to use the ETW or other information to try and track down when / what is preventing SF from cancelling the token?

Thoughts?

Thanks.

Jaans commented 4 years ago

Uhm... so apparently the cancellation token is indeed being cancelled.

However, our problem remains, the service process is still running. I can confirm that the operation cancelled exception is indeed being thrown and "bubbles" out of the RunAsync() method.

Just can't figure out why the process isn't being terminated. I'm hoping it's something silly.

Again, if there is any guidance on how we might use the SF tracing or other resources to help identify that would be great.

Jaans commented 4 years ago

Apologies... time to eat some humble pie. The issue had nothing to do with the RunAsync() portion. In fact, it was another service in the same package that retained an open in progress call that never completed. Fixing that, resolved our issue entirely.

JustinKaffenberger commented 4 years ago

@Jaans when you say same package are you referring to the Application Package or Service Package? (I'll assume same service package for the time being)

Jaans commented 4 years ago

@JustinKaffenberger, yes the same service package.