microsoft / service-fabric

Service Fabric is a distributed systems platform for packaging, deploying, and managing stateless and stateful distributed applications and containers at large scale.
https://docs.microsoft.com/en-us/azure/service-fabric/
MIT License
3.03k stars 401 forks source link

'Unable to determine whether the application is installed on the cluster or not' errors occur too often in local cluster. #800

Open wminos opened 6 years ago

wminos commented 6 years ago

SDK Version: Microsoft Azure Service Fabric SDK - 3.0.472

I did not do it for the first time but now it occurs more than once a day and I am resetting it every time. I am mainly developing in the Visual Studio 2017 IDE, and I am not sure if this is a bug, but it is inconvenient for development.

wminos commented 6 years ago

I already see 'https://github.com/Azure/service-fabric-issues/issues/199' issue. I am working on Windows 10 and using the default anti-virus (Windows Defender). The C:\SfDevCluster folder was excluded from the scan.

abatishchev commented 6 years ago

Same here. SDK 3.0.472, both VS 2017 and 2015. Only resetting local cluster helps but it's annoying and time-consuming.

sathiathirumal commented 6 years ago

+1. Does resetting even work for you folks? For me resetting also fails. I am 99% sure this is because the fabricDNSService.exe doesnt shutdown. I am unable to kill it manually as well (access denied - pskill, procexp, nothing works). Only a PC restart does solve it. Such a waster of time! MSFT, please fix!

mikkelhegn commented 6 years ago

Without knowing the reason the cluster get's stuck, this is what's going on and maybe can help you workaround until we know more.

Visual Studio (and the Local Cluster Manager) probes the cluster endpoint locally, if it's not responding you will see this error. So there's a good chance the local cluster is unresponsive, hence resetting the cluster helps.

Those of you who run in to this issue, I would appreciate if you can share trace files from the cluster (SfDevCluster\Log\Traces) - thanks.

abatishchev commented 6 years ago

Emptied C:\SfDevCluster\Log\Traces\, Opened VS2015, pressed F5, got the same error "'Unable to determine whether the application is installed on the cluster or not", closed VS, stopped local cluster, zipped all newly created files, got this. Please let me know if these logs help or if I need to gather again or somethings else. Happy to help!

abatishchev commented 6 years ago

Here's another exception that seems related to this issue:

Started executing script 'Publish-NewServiceFabricApplication'.
powershell -NonInteractive -NoProfile -WindowStyle Hidden -ExecutionPolicy Bypass -Command "[void](Connect-ServiceFabricCluster); Import-Module 'C:\Program Files\Microsoft SDKs\Service Fabric\Tools\PSModule\ServiceFabricSDK\ServiceFabricSDK.psm1'; Publish-NewServiceFabricApplication -ApplicationPackagePath '...\PublishProfiles\..\ApplicationParameters\Local.1Node.xml' -ApplicationParameter @{_WFDebugParams_='[{ ... }]'} -Action Create -SkipPackageValidation:$true -ErrorAction Stop"
Creating application...
New-ServiceFabricApplication : Could not ping any of the provided Service Fabric gateway endpoints.
At C:\Program Files\Microsoft SDKs\Service 
Fabric\Tools\PSModule\ServiceFabricSDK\Publish-NewServiceFabricApplication.ps1:279 char:9
+         New-ServiceFabricApplication -ApplicationName $ApplicationNam ...
+         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (Microsoft.Servi...usterConnection:ClusterConnection) [New-ServiceFabr 
   icApplication], FabricTransientException
    + FullyQualifiedErrorId : CreateApplicationInstanceErrorId,Microsoft.ServiceFabric.Powershell.NewApplication

Finished executing script 'Publish-NewServiceFabricApplication'.
Time elapsed: 00:02:06.7432516
BrainSlugs83 commented 6 years ago

I just updated to latest and now I get this every time I hit F5 in Visual Studio 2017. -- I have to restart the local cluster every time I want to run my project, it's super annoying.

dbreshears commented 6 years ago

@BrainSlugs83 , when this occurs, are you able to bring up the Service Fabric Explorer through the tray icon "Manage Local Cluster" and check the health state of the cluster?

Does this also reproduce for you if you use a different Application Debug Mode? I am assuming this property is currently set to the default "Refresh Application"? Does it repro if you are using "Remove Application"?

abatishchev commented 6 years ago

My local cluster goes so unstable that SFE can't manage it, shows an error suggesting to restart it but this doesn't help either, nor restarting the service. Only rebooting, very annoying.

abatishchev commented 6 years ago

Don't know whether this is related. FabricDCA.exe and FabricFAS.exe silently crash every minute and generate 170 MB of crash dumps at C:\SfDevCluster\Log\CrashDumps.

vturecek commented 6 years ago

@abatishchev can you make those crash dumps available somewhere? @rishirsinha might want to take a look at these FabricDCA and FabricFAS crashes.

rishirsinha commented 6 years ago

@anmolah can you add the right set of people to this thread?

abatishchev commented 6 years ago

Please grab them from \\alexbat-id1\CrashDumps

abatishchev commented 6 years ago

I'm troubleshooting unrelated issue, turned on Fusion Logging and saw the following error:

=== Pre-bind state information ===
LOG: DisplayName = Microsoft.ServiceFabric.Data.Interfaces, Version=5.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35, processorArchitecture=AMD64
 (Fully-specified)
LOG: Appbase = file:///C:/SfDevCluster/Data/_App/_Node_0/__FabricSystem_App4294967295/FAS.Code.Current/
LOG: Initial PrivatePath = NULL
LOG: Dynamic Base = NULL
LOG: Cache Base = NULL
LOG: AppName = FabricFAS.exe
Calling assembly : (Unknown).
===
LOG: This bind starts in default load context.
LOG: No application configuration file found.
LOG: Using host configuration file: 
LOG: Using machine configuration file from C:\Windows\Microsoft.NET\Framework64\v4.0.30319\config\machine.config.
LOG: GAC Lookup was unsuccessful.
LOG: Attempting download of new URL file:///C:/SfDevCluster/Data/_App/_Node_0/__FabricSystem_App4294967295/FAS.Code.Current/Microsoft.ServiceFabric.Data.Interfaces.DLL.
LOG: Attempting download of new URL file:///C:/SfDevCluster/Data/_App/_Node_0/__FabricSystem_App4294967295/FAS.Code.Current/Microsoft.ServiceFabric.Data.Interfaces/Microsoft.ServiceFabric.Data.Interfaces.DLL.
LOG: Attempting download of new URL file:///C:/SfDevCluster/Data/_App/_Node_0/__FabricSystem_App4294967295/FAS.Code.Current/Microsoft.ServiceFabric.Data.Interfaces.EXE.
LOG: Attempting download of new URL file:///C:/SfDevCluster/Data/_App/_Node_0/__FabricSystem_App4294967295/FAS.Code.Current/Microsoft.ServiceFabric.Data.Interfaces/Microsoft.ServiceFabric.Data.Interfaces.EXE.
LOG: All probing URLs attempted and failed.

And indeed Microsoft.ServiceFabric.Data.Interfaces.dll is not present in C:\SfDevCluster\Data\_App\_Node_0\__FabricSystem_App4294967295\FAS.Code.Current\.

abatishchev commented 6 years ago

And another error:

=== Pre-bind state information ===
LOG: DisplayName = System.Fabric.Strings, Version=6.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35
 (Fully-specified)
LOG: Appbase = file:///C:/SfDevCluster/Data/_Node_0/Fabric/DCA.Code/
LOG: Initial PrivatePath = NULL
LOG: Dynamic Base = NULL
LOG: Cache Base = NULL
LOG: AppName = FabricDCA.exe
Calling assembly : (Unknown).
===
LOG: This bind starts in default load context.
LOG: Using application configuration file: C:\SfDevCluster\Data\_Node_0\Fabric\DCA.Code\FabricDCA.exe.Config
LOG: Using host configuration file: 
LOG: Using machine configuration file from C:\Windows\Microsoft.NET\Framework64\v4.0.30319\config\machine.config.
LOG: GAC Lookup was unsuccessful.
LOG: Attempting download of new URL file:///C:/SfDevCluster/Data/_Node_0/Fabric/DCA.Code/System.Fabric.Strings.DLL.
LOG: Attempting download of new URL file:///C:/SfDevCluster/Data/_Node_0/Fabric/DCA.Code/System.Fabric.Strings/System.Fabric.Strings.DLL.
LOG: Attempting download of new URL file:///C:/SfDevCluster/Data/_Node_0/Fabric/DCA.Code/System.Fabric.Strings.EXE.
LOG: Attempting download of new URL file:///C:/SfDevCluster/Data/_Node_0/Fabric/DCA.Code/System.Fabric.Strings/System.Fabric.Strings.EXE.
LOG: All probing URLs attempted and failed.

And again System.Fabric.Strings.dll is not present in C:\SfDevCluster\Data\_Node_0\Fabric\DCA.Code\

abatishchev commented 6 years ago

Does the code rely these assemblies to present in GAC? For me they weren't there. Registered. Will see whether it'll fix those crashes.

Update: crashes are gone now.

Update 2: the said error seems to be gone too.

shaohaolin commented 6 years ago

+1 still having this problem. SDK 3.1.269, VS 2017 Enterprise.

iamalexmang commented 6 years ago

I too am facing the same issue. Figuring out what the best option is at this point and am seriously considering using a VM on Azure, non-domain joined or anything alike, and force down the installation of an older SDK (if possible).

shaohaolin commented 6 years ago

Rollback to SDK 3.0456 fix my problem. I suspect SDK 3.1.269 installing Service Fabric runtime 6.2.269, could it be the cause?

aloneguid commented 6 years ago

Hey guys, any update on this one? Literally none of the developers in my team can make 6.2 work locally. Downgrading fixes the issue.

rishirsinha commented 6 years ago

Can you try uninstall of the previous version of the runtime, then reboot the machine and re-install the latest version?

aloneguid commented 6 years ago

@rishirsinha thanks for a quick response. I've tried uninstalling runtime and sdk and manually cleaning up any leftovers, including SfDevCluster folder and the issue still persists. I can see the following errors in the event log:

CertCreateSelfSignCertificate failed: E_ACCESSDENIED
ipcServer->SecuritySettings.CreateSelfGeneratedCertSslServer error=S_OK
Fabric Node open failed with error code = E_ACCESSDENIED

happening on every retry of cluster creation.

rishirsinha commented 6 years ago

@aloneguid

This seems like a different issue.

@RajeetN

Rajeet any ideas what this might be?

aloneguid commented 6 years ago

lease_traces_6.2.274.9494_131720609089720816_0.zip

I've tried to run from the terminal to get more details, attaching output and traces here as well:

\Program Files\Microsoft SDKs\Service Fabric\ClusterSetup> .\DevClusterSetup.ps1 -CreateOneNodeCluster
WARNING: A local Service Fabric Cluster already exists on this machine and will be removed.
Do you want to continue [Y/N]?: y
Removing cluster configuration...
Cleaning existing certificates...
Certificates removed.
Stopping all logman sessions...
Cleaning log and data folder...

Using Cluster Data Root: C:\SfDevCluster\Data
Using Cluster Log Root: C:\SfDevCluster\Log

The generated json path is C:\Users\ivang\AppData\Local\Temp\tmp978A.tmp.json
Processing and validating cluster config.
Create node configuration succeeded
Starting service FabricHostSvc. This may take a few minutes...

Waiting for Service Fabric Cluster to be ready. This may take a few minutes...
Local Cluster ready status: 4% completed.
Local Cluster ready status: 8% completed.
Local Cluster ready status: 12% completed.
Local Cluster ready status: 17% completed.
Local Cluster ready status: 21% completed.
Local Cluster ready status: 25% completed.
Local Cluster ready status: 29% completed.
Local Cluster ready status: 33% completed.
Local Cluster ready status: 38% completed.
Local Cluster ready status: 42% completed.
Local Cluster ready status: 46% completed.
Local Cluster ready status: 50% completed.
Local Cluster ready status: 54% completed.
Local Cluster ready status: 58% completed.
Local Cluster ready status: 62% completed.
Local Cluster ready status: 67% completed.
Local Cluster ready status: 71% completed.
Local Cluster ready status: 75% completed.
Local Cluster ready status: 79% completed.
Local Cluster ready status: 83% completed.
Local Cluster ready status: 88% completed.
Local Cluster ready status: 92% completed.
Local Cluster ready status: 96% completed.
Local Cluster ready status: 100% completed.
WARNING: Service Fabric Cluster is taking longer than expected to connect.

Waiting for Naming Service to be ready. This may take a few minutes...
No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue.
Connect-ServiceFabricCluster : No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue.
At C:\Program Files\Microsoft SDKs\Service Fabric\Tools\Scripts\ClusterSetupUtilities.psm1:620 char:12
+     [void](Connect-ServiceFabricCluster @connParams)
+            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (:) [Connect-ServiceFabricCluster], FabricException
    + FullyQualifiedErrorId : TestClusterConnectionErrorId,Microsoft.ServiceFabric.Powershell.ConnectCluster
manimaranm7 commented 6 years ago

I tried the same as @aloneguid and have some success. the only difference I think was that I had stopped the "Internet Connection Sharing (ICS)" windows service before running the command to create the local cluster.

C:\Program Files\Microsoft SDKs\Service Fabric\ClusterSetup> .\DevClusterSetup.ps1 -CreateOneNodeCluster
WARNING: A local Service Fabric Cluster already exists on this machine and will be removed.
Do you want to continue [Y/N]?: y
Removing cluster configuration...
Cleaning existing certificates...
Certificates removed.
Stopping all logman sessions...
Cleaning log and data folder...

Using Cluster Data Root: C:\SfDevCluster\Data
Using Cluster Log Root: C:\SfDevCluster\Log

The generated json path is C:\Users\zunem\AppData\Local\Temp\tmp96A5.tmp.json
Processing and validating cluster config.
Create node configuration succeeded
Starting service FabricHostSvc. This may take a few minutes...

Waiting for Service Fabric Cluster to be ready. This may take a few minutes...
Local Cluster ready status: 4% completed.
Local Cluster ready status: 100% completed.

Waiting for Naming Service to be ready. This may take a few minutes...
Naming Service is ready now...

Local Service Fabric Cluster created successfully.

=================================================
## To connect using Powershell, open an a new powershell window and connect using 'Connect-ServiceFabricCluster' command (without any arguments)."

## To connect using Service Fabric Explorer, run ServiceFabricExplorer and connect using 'Local/OneBox Cluster'."

## To manage using Service Fabric Local Cluster Manager (system tray app), run ServiceFabricLocalClusterManager.exe"
=================================================
C:\Program Files\Microsoft SDKs\Service Fabric\ClusterSetup>

By the way, the ICS has now automatically restarted. So, in effect, both ICS and SFCluster are running and things look a stable so far.

aloneguid commented 6 years ago

@manimaranm7 I've tried stopping ICS, didn't make a difference for me :(

BrainSlugs83 commented 6 years ago

I ran some commands to figure out what was using that port on my machine (can't remember now, but I just googled it), and it turned out it was the Internet Connection Sharing service (which is on by default in the latest windows updates -- even if you turn it off, and fully disable it, sometimes it turns itself back on...) -- anyway, anytime I have this issue, I just pop open services.msc and disable it again, and the problem goes away for a while (until ICS turns itself back on).

Considering that this app is having port conflicts with a well known, widely-deployed service, can you guys just update the port your app is using for local development?

BrainSlugs83 commented 6 years ago

@dbreshears -- sorry, I don't remember if I was able to do that or not -- I just know that unless I have the Internet Connection Sharing service disabled, that I can't deploy new code. [There is a 100% (inverse) correlation between that service running, and being able to deploy code to the local Service Fabric instance for me -- over the last 2 months, and on multiple machines, ICS has been the cause every single time.]

And no, I'm not using the Refresh Application setting*, instead, I'm using Remove Application.

(*The Refresh Application setting has always been really flaky for me, and I've got my own IActorStateManager implementation that's backed by Azure Cosmos Document DB, so it's not big deal if all the data in the cluster gets wiped every time I deploy or debug it; all I lose is reminders, and my actors just recreate those when the service comes up.)

dbreshears commented 6 years ago

@BrainSlugs83, I thought the ICS issue was resolved in 6.2 release. Let us know if on that version and still seeing the issue.

ddobric commented 6 years ago

One of reasons, why this error happen inside of Visual Studio is following.

These steps lead to the error

Unable to determine whether the application is installed on the cluster or not

To workaround it, in the fist step you should select "Local.1Node.xml"

It would be great if the team could change this behavior. This error message can mean many things and in this specific case didn't help me at all.

StenPetrov commented 5 years ago

@ddobric I tried publishing with Local.1Node.xml selected but I still can't get the app to run locally.

healthycola commented 5 years ago

This is still happening, any updates on this?

mikkelhegn commented 5 years ago

@healthycola - What symptoms are you seeing? There are a few different scenarios in this thread.

anantshankar17 commented 5 years ago

Does any one still face the 'Unable to determine whether the application is installed on the cluster or not' errors with the latest runtime/sdks ? If yes, can you please provide repro steps ?

avichalchum commented 5 years ago

I do still face the issue with the latest sdk. Resetting, restarting, nothing helps.

anantshankar17 commented 5 years ago

@avichalchum Can you please describe, how do enter into this state ? What is the exact state of the cluster at that time ? Is the fabrichost svc in running state ? To fix the issue I need to repro it myself to makeout what is wrong.

avichalchum commented 5 years ago

@anantshankar17 I don't know how to enter this state as it is always in the state. As in, I restarted my computer, I force closed the Service Fabric local cluster manager and opened it again, I reset the cluster, I even uninstalled the SDK and reinstalled it. However, in Visual Studio when I try to deploy the app, it always says error while deploying with the error, "'Unable to determine whether the application is installed on the cluster or not". The fabrichost svc is in the running state at that time and the state of the cluster is healthy and the cluster manager opens fine showing everything is good. Let me know what other information I can give.

ravipal commented 5 years ago

@avichalchum Visual Studio runs "Connect-ServiceFabricCluster" and "Get-ServiceFabricApplication -ApplicationName {appname}" commands to check whether the applications is deployed on the local cluster or not. Seems like these commands fails. When you hit the error in VS, please open a powershell window and run these commands and see if it works.

NorthHighlandNicole commented 4 years ago

I have the SAME exactly issues as @avichalchum There is no set of "steps to reproduce this. It happens even when I start from a fresh reboot

Ryanman commented 4 years ago

I frequently have this issue. I believe it relates to networking.

On a fresh reboot, I can often get a local debug to work. After connecting to my org's VPN, the cluster's state cannot be determined. Disconnecting from the VPN, resetting the cluster, and trying again does not work.

Our team has played around with creating additional lists of fallback DNS servers in internet settings. For some of us and in some locations this works (I.E. home vs office ISP), for others it doesn't. Our documentation's steps:

Step 1: Stop the cluster in Service Fabric via the icon in the system tray Step 2: Open your network adapter setting for the current connection and right click Properties. Step 3: Double click IPv4 and click the Advanced button Step 4: Click the DNS tab and in this order: DNS servers of your internet provider, the VPN DNS (172.0.0.1) and your local IP Address to the list of DNS server, making sure you put your local IP Address last in the list (this is the important part) Step 5: Close all the open dialog boxes Step 6: Right click on the SF Icon in the System tray and Start the Cluster Once the cluster has started, you should see that your IP Address is still listed as 3rd in the DNS server list. You should be able to debug normally now in Visual Studio.

larlew commented 4 years ago

@Ryanman With regards to the DNS settings above, are you setting the dns settings for the local network connection or the VPN adapter connection?

Ryanman commented 4 years ago

Larlew - this was for the local network connection, not the VPN I believe.

My suggestion comes with large caveats but if you're at your wit's end it's worth a shot. The VPN implementation for the environment I'm working with SF is by far the worst I've ever seen, with extremely unstable split tunneling etc. It's not a recipe for success with SF.

sukhovy commented 3 years ago

I faced with the same issue. The actual problem was in publish profile. The ConnectionEndpoint parameter was empty: <ClusterConnectionParameters ConnectionEndpoint="" />

After removing ConnectionEndpoint parameter it works well. I think it is'n the only one reason of this problem but my solution can be useful for someone.