microsoft / service-fabric-issues

This repo is for the reporting of issues found with Azure Service Fabric.
168 stars 21 forks source link

Asp.net Core 2.1 Service Offline issue #1097

Closed francescocristallo closed 6 years ago

francescocristallo commented 6 years ago

I updated my application to Asp.net Core 2.1, and after a couple hours all my services were offline.

SDK 3.1.283 Current fabric version 6.2.283.9494

In local all works, but I was unable to complete the entire migration as posted here

I have a Stateless service with 9 F8SV2 machines, the Service Fabric Explorer shows no error or Warning, but from Application insight I can see only 7 machines were used, and after 30 minutes only 3 then 0 and all went offline

live

In the server => Event viewer I see this warning, and the CPU is not used

event

I think this is a compatibility issue with Asp.net Core 2.1. Same code reversed to Asp.net Core 2.0 works without any problem.

Thoughts?

francescocristallo commented 6 years ago

If I reboot the nodes from Service Fabric Explorer they go up again for 20 minutes then suddently and randomly disappear again

masnider commented 6 years ago

If they are exiting with exceptions or below target replica set size, then SFX will be complaining. If they are somehow getting deleted then that could definitely result in the behavior you're describing. The fact that they're getting respawned when you restart the machine means they're probably still around.

What are you trying to show with the "Engine" circled on the left image? If that process is there then usually the service itself is still running inside it.

Can you share: How you're creating the services? Can you share the current service description and health? Can you share SFX showing the service's status? How many instances are shown? Whether you've got any autoscale policies defined?

masnider commented 6 years ago

hi @francescocristallo - any way you can provide that additional information? Alternatively I see that Loek (one of our MVPs) replied in Github. Was his hint able to help you as well?

francescocristallo commented 6 years ago

Sorry for late reply. I redid a test yesterday with a second cluster alongside my production one, I updated to Asp.net Core 2.1 and switched all the traffic to the new cluster. Same problem, after a while under heavy load the APIs do not respond anymore without any other change but the migration to Asp.net Core 2.1.

I tried using Libuv instead of the new 2.1 Sockets but same result. I also detached the code from Service Fabric publishing the APIs in an azure WebApp and I didn't see any slowdown or any other problem. In reply to the previous questions:

-Service Fabric Explorer does not show any error during the live test. The only error I saw is after the test, when deleting the application. Whatever is going on, the Service is stuck for receiving live calls and for deletion both

capture

-In the OP thread the Engine circled means the service is still there but the CPU is 0% while it should be 30% or more. So is still around but not accepting any requests -The cluster is being created via Azure portal -The SFX shows all the correct services running and there are no errors at all. -Autoscale is in place but during my test the CPU never went over 30% and there are 5 F8SV2 Machines for around 2000 requests per second to Asp.Net core API

In the EventLogs on the server I see these Warnings multiple times:


-LookupAccountSidW failed. Result=0x80070534

-Activate: Activate:MyAppType_App3:EnginePkg@61ff2931-03d3-43af-846c-1f18177a76fb@fbeb71ec-fa7c-4e44-8638-8be7d19477d2:1.0:1.0:131761997448675030, ErrorCode=FABRIC_E_OBJECT_CLOSED, RetryCount=0

-MyApp_App3:EnginePkg@61ff2931-03d3-43af-846c-1f18177a76fb@fbeb71ec-fa7c-4e44-8638-8be7d19477d2: End(Setup->EndCleanupServicePackageEnvironment due to error FABRIC_E_OBJECT_CLOSED): error 0x80070002

-End(ActivateServicePackageInstance): AppId=MyAppType_App3, AppVersion=1.0 ServicePackageName=EnginePkg, ServicePackageActivationContext=61ff2931-03d3-43af-846c-1f18177a76fb, ServicePackageVersionInstance=1.0:1.0:131761997448675030, Error=FABRIC_E_OBJECT_CLOSED

-End(OpenVersionedServicePackage): Id=MyAppType_App3:EnginePkg@61ff2931-03d3-43af-846c-1f18177a76fb@fbeb71ec-fa7c-4e44-8638-8be7d19477d2:131762525192507402, Version=1.0:1.0:131761997448675030, ErrorCode=FABRIC_E_OBJECT_CLOSED

-End(SetupPackageEnvironment): Id=MyAppType_App3:EnginePkg@61ff2931-03d3-43af-846c-1f18177a76fb@fbeb71ec-fa7c-4e44-8638-8be7d19477d2, Version=1.0:1.0:131761997448675030, ErrorCode=FABRIC_E_OBJECT_CLOSED

-Failed to remove enpoint resource file=D:\SvcFab_App\MyAppType_App3\EnginePkg.fbeb71ec-fa7c-4e44-8638-8be7d19477d2.Endpoints.txt. Error=0x80070002. NodeVersion=6.2.301.9494:1:131759203060167149.

-client-10.0.0.5:19000 : connect failed, having tried all addresses

-client-10.0.0.5:19000/10.0.0.5:19000: error = 2147943625, failureCount=27. Filter by (type~Transport.St && ~"(?i)10.0.0.5:19000") to get listener lifecycle. Connect failure is expected if listener was never started, or listener/its process was stopped before/during connecting.


I think there is something wrong in the Main() method in Program.cs where ServiceRuntime and ServiceEventSource are initialized.

The Asp.net Core 2.1 migration instructions requires changes in Program.cs that I wasn't able to do as per the Stackoverflow question above. I think I'll wait for an official template for Service Fabric and Asp.net Core 2.1 before testing again, and I think this should be investigated by the team, probably one of the breaking changes of Asp.net Core 2.1 affect the Service Fabric hosting environment, unfortunately this is visible under mid/heavy load, in local and with a small number of calls it all works.

francescocristallo commented 6 years ago

Could this be related to the Windows Server 2016 fix? https://blogs.msdn.microsoft.com/azureservicefabric/2018/08/03/os-update-required-on-windows-clusters/

francescocristallo commented 6 years ago

The problem still persists as today September 17 using version 6.3.176.9494. Anyone is using Asp.net core 2.1 with success? Nodes get stuck as soon as the APIs receive some traffic. Same code switched back to Asp.net Core 2.0 works normally.

francescocristallo commented 6 years ago

I was finally able to resolve this. It was a combination of the Windows Server 2016 fix above, plus something internal to Asp.Net Core 2.1. After updating the servers and deploying Asp.Net Core 2.1.4 (before It was 2.1.2) everything is stable and fast.