microsoft / service-fabric

Service Fabric is a distributed systems platform for packaging, deploying, and managing stateless and stateful distributed applications and containers at large scale.
https://docs.microsoft.com/en-us/azure/service-fabric/
MIT License
3.03k stars 401 forks source link

SFRPStreamChannel Exceptions in Upgrade Service #166

Open charstar opened 6 years ago

charstar commented 6 years ago

I've been periodically experiencing warnings and errors in the upgrade service with errors similar to: Error event: SourceId='UpgradeService.Primary', Property='SFRPStreamChannel'. Exception encountered: System.IO.IOException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.

I've previously contacted Azure support around this and received advice around ensuring firewall rules for 168.63.129.16, and allowing outbound traffic to the westus.servicefabric.azure.com IP addresses, however this hasn't seemed to fix the issue.

Is there anything else I'm missing here? Thanks.

mchiriac-msft commented 6 years ago

Can you provide following details?

charstar commented 6 years ago

Cluster ID: f4dcb896-4265-4c68-a4ec-a0d1f8096c21 Location: West US Time: Most recently, 2018-07-02 1250-1310PDT (1950-2010UTC)

Thanks for your help!

mchiriac-msft commented 6 years ago

This is a known issue that we are tracking.

The communication for SFRPStreamChannel created between UpgradeService and ServiceFabric ARM resource provider (SFRP) is monitored. If continuous failures are detected, an error health event is raised. The detection logic is too sensitive and needs rework. We will release a fix for it with ServiceFabric 6.3 CU1 runtime. Date for its release is yet to be established.

You can ignore the health error report as it is transient and should not have an impact on the management operations performed for application and cluster. I checked the cluster and applications and they were all healthy.

Your cluster is pinned to ServiceFabric 6.1. UserSelectedCodeVersion = 6.1.456.9494 UpgradeMode = Manual

In order to take the fix you will have to update UserSelectedCodeVersion with the 6.3 CU1 version and perform an ARM cluster deployment. The alternative is the use the auto upgrade mode and SFRP will automatically upgrade the cluster to take the fix when available.

jagilber commented 6 years ago

@charstar i have update from product group about this fix. fix did not make 6.3 and is now scheduled to be released in service fabric 6.4 tentative end of Q3

the error can be cleaned up by failing over the primary for 'UpgradeService'. this, however, is a temporary solution and the error can still reappear after some time as our logic for detecting these intermittent communication failures is too sensitive.

pksorensen commented 6 years ago

I am seeing it too, whats the impact? does it matter?

m1stegmann commented 5 years ago

We got the following exception every two or three days. How to deal with this error?

SF Version: 6.3.176.9494 With a five node production cluster with D13V2 instances

Error event: SourceId='UpgradeService.Primary', Property='SFRPStreamChannel'. Exception encountered: System.Threading.Tasks.TaskCanceledException: A task was canceled. at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Fabric.UpgradeService.WrpStreamChannel.d30.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Fabric.UpgradeService.WrpStreamChannel.d19`1.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Fabric.UpgradeService.WrpStreamChannel.d18.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Fabric.UpgradeService.ClusterCoordinator.d16.MoveNext()

mchiriac-msft commented 5 years ago

The issue should be transient. It will not impact the deployed cluster or applications. The health error report should clear automatically.

UpgradeService is timing out the calls against the ServiceFabric resource provider in ARM to collect the stream channel request. Azure Portal or the cluster Service Fabric Explorer(SFX) might take longer to show the runtime view of the cluster.

@pksorensen/@m1stegmann do you still see the issue on your clusters? If so can you provide the cluster ARM resource ID and the region?

pksorensen commented 5 years ago

@mchiriac-msft Nope, i dont see it any more.

However, i did have the problem that my ARM deployment did not work while the error was ongoing ?

As soon as the error was gone, i could do a ARM deployment of application again.

mchiriac-msft commented 5 years ago

@pksorensen can you provide an ARM correlation ID for the deployment that failed while the error was ongoing? I'll look to see what happened.