microsoft / service-fabric

Service Fabric is a distributed systems platform for packaging, deploying, and managing stateless and stateful distributed applications and containers at large scale.
https://docs.microsoft.com/en-us/azure/service-fabric/
MIT License
3.03k stars 401 forks source link

[BUG] Cannot successfully setup dev cluster locally: "No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue." #1259

Open emmet-m opened 3 years ago

emmet-m commented 3 years ago

Describe the bug I cannot successfully setup any cluster from my machine, either from the taskbar menu or from powershell. Whenever I run the script, for either 1 node or 5 nodes, I get the following output:

image

Note that this takes a LONG time, setting up a one node cluster from the task bar menu on my old laptop takes much less time (and succeeds) compared to my dev box which is much more powerful.

See copy of log file pasted below.

To fix this, I've tried the following solutions, all of which have failed:

Area/Component: SDK

To Reproduce I was unable to reproduce this on another machine... I have no idea what's causing it

Expected behavior The cluster to be successfully set up

Service Fabric Runtime Version:

Environment:

Verbatim log file (C:\SfDevCluster\Log\DevClusterSetup.log)

**********************
Windows PowerShell transcript start
Start time: 20210827001511
Username: REDMOND\emmurra
RunAs User: REDMOND\emmurra
Configuration Name: 
Machine: EM-DESKTOP (Microsoft Windows NT 10.0.19043.0)
Host Application: C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe
Process ID: 14720
PSVersion: 5.1.19041.1151
PSEdition: Desktop
PSCompatibleVersions: 1.0, 2.0, 3.0, 4.0, 5.0, 5.1.19041.1151
BuildVersion: 10.0.19041.1151
CLRVersion: 4.0.30319.42000
WSManStackVersion: 3.0
PSRemotingProtocolVersion: 2.3
SerializationVersion: 1.1.0.1
**********************
Transcript started, output file is C:\SfDevCluster\Log\DevClusterSetup.log
Performing Stop-Service on: FabricHostSvc . This may take a few minutes...
Create node configuration succeeded
Performing Start-Service on: FabricHostSvc . This may take a few minutes...

Waiting for Service Fabric Cluster to be ready. This may take a few minutes...
PS>TerminatingError(Connect-ServiceFabricCluster): "No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue."
Local Cluster ready status: 4% completed.
PS>TerminatingError(Connect-ServiceFabricCluster): "No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue."
Local Cluster ready status: 8% completed.
PS>TerminatingError(Connect-ServiceFabricCluster): "No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue."
Local Cluster ready status: 12% completed.
PS>TerminatingError(Connect-ServiceFabricCluster): "No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue."
Local Cluster ready status: 17% completed.
PS>TerminatingError(Connect-ServiceFabricCluster): "No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue."
Local Cluster ready status: 21% completed.
PS>TerminatingError(Connect-ServiceFabricCluster): "No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue."
Local Cluster ready status: 25% completed.
PS>TerminatingError(Connect-ServiceFabricCluster): "No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue."
Local Cluster ready status: 29% completed.
PS>TerminatingError(Connect-ServiceFabricCluster): "No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue."
Local Cluster ready status: 33% completed.
PS>TerminatingError(Connect-ServiceFabricCluster): "No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue."
Local Cluster ready status: 38% completed.
PS>TerminatingError(Connect-ServiceFabricCluster): "No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue."
Local Cluster ready status: 42% completed.
PS>TerminatingError(Connect-ServiceFabricCluster): "No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue."
Local Cluster ready status: 46% completed.
PS>TerminatingError(Connect-ServiceFabricCluster): "No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue."
Local Cluster ready status: 50% completed.
PS>TerminatingError(Connect-ServiceFabricCluster): "No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue."
Local Cluster ready status: 54% completed.
PS>TerminatingError(Connect-ServiceFabricCluster): "No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue."
Local Cluster ready status: 58% completed.
PS>TerminatingError(Connect-ServiceFabricCluster): "No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue."
Local Cluster ready status: 62% completed.
PS>TerminatingError(Connect-ServiceFabricCluster): "No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue."
Local Cluster ready status: 67% completed.
PS>TerminatingError(Connect-ServiceFabricCluster): "No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue."
Local Cluster ready status: 71% completed.
PS>TerminatingError(Connect-ServiceFabricCluster): "No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue."
Local Cluster ready status: 75% completed.
PS>TerminatingError(Connect-ServiceFabricCluster): "No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue."
Local Cluster ready status: 79% completed.
PS>TerminatingError(Connect-ServiceFabricCluster): "No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue."
Local Cluster ready status: 83% completed.
PS>TerminatingError(Connect-ServiceFabricCluster): "No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue."
Local Cluster ready status: 88% completed.
PS>TerminatingError(Connect-ServiceFabricCluster): "No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue."
Local Cluster ready status: 92% completed.
PS>TerminatingError(Connect-ServiceFabricCluster): "No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue."
Local Cluster ready status: 96% completed.
PS>TerminatingError(Connect-ServiceFabricCluster): "No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue."
Local Cluster ready status: 100% completed.
PS>TerminatingError(Connect-ServiceFabricCluster): "No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue."
WARNING: Service Fabric Cluster is taking longer than expected to connect.

Waiting for fabric:/System/NamingService to be ready. This may take a few minutes...
PS>TerminatingError(Connect-ServiceFabricCluster): "No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue."
>> TerminatingError(Connect-ServiceFabricCluster): "No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue."
>> TerminatingError(Connect-ServiceFabricCluster): "No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue."
No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue.
**********************
Windows PowerShell transcript end
End time: 20210827002446
**********************

Assignees: /cc @microsoft/service-fabric-triage

emmet-m commented 3 years ago

One more thing - I've had service fabric running successfully before until today. I haven't run our project for a while (maybe a few weeks, a month) and all of a sudden it didn't work.

Also possible related: #1227

acartcat commented 3 years ago

I have the exact same issue, both Windows 10 Pro and Windows 11 The closest I've got is in Windows events

acartcat commented 3 years ago

Sorted! I was running a 5 node. I ran clean cluster then configured a 1 node which came up without problem. I then switched to a 5 node and all good. I'm thinking there was some remnants of the old version cluster definition.

emmet-m commented 3 years ago

Unfortunately this didn't fix it for me. I've switched a few times between 5 and 1 node configs after cleaning up the old cluster and it didn't fix anything.

bmatthewson commented 3 years ago

I'm having this same issue. It had been working recently - possibly 4 or 5 days ago. A few notes...

I had a similar issue a few months ago and it pointed to my VM having control issues with host CPU resources. I had to ensure the host was setup with EUFI and that AMD SVM was enabled in the BIOS. I had to install Win11 in order to change the resource control setting from root to core in Hyper-V. I believe it was only available in VM version 9+. Some of those details are fuzzy right now.

It was all working great until I tried earlier today. I retraced my steps from above and it's still not connecting. I'm hoping I just missed something.

sowenzhang commented 3 years ago

ugh, I am having exactly same issue now. And it is just suddenly showing up. I was able to run everything about 3 weeks ago. But today, Service Fabric just refused to run. Done everything: reboot, reinstall, cleanup, etc. Nothing helps.

This is frustrating.

rwardms commented 3 years ago

I had what sounds like a very similar issue and tried all sorts of debugging attempts. Eventually I found this comment in another local cluster-related thread: https://github.com/microsoft/service-fabric/issues/382#issuecomment-542356378

There it recommended trying to run the fabric host process directly using 'FabricHost.exe -c', which runs it in console mode, and for me it popped up a couple windows and showed my problem:

System Error : The code execution cannot proceed because MSVCR110.dll was not found. Reinstalling the program may fix this problem.

This file is from the Microsoft Visual C++ Redist 12: https://www.microsoft.com/en-us/download/details.aspx?id=30679

Once I executed that installer and selected "repair", I was able to create my local cluster again.

Hope this helps.

Robert

emmet-m commented 3 years ago

@rwardms While that doesn't seem to have fixed my problem, that did help me make a lot of progress, thank you!

I went to C:\Program Files\Microsoft Service Fabric\bin and ran .\FabricHost.exe -c, which opened up a console that filled with logs/error message before closing a second later. This console kept reappearing and disappearing every second or so with the same logs. Eventually I managed to click the window fast enough, which paused the console, and mashing CTRL+A allowed one log to be printed out line by line. The most amount of logs I could get was this:

FabricSetup.exe invoked with arguments (C:\Program Files\Microsoft Service Fabric\bin\Fabric\Fabric.Code\FabricSetup.exe /operation:addnodestate). Current Exe version 8.1.321.9590
Environment information Data Root C:\SfDevCluster\Data, Log Root C:\SfDevCluster\Log
Starting service eventlog
Starting service pla
Starting FolderACLManager::Install
Obtained exclusive file C:\SfDevCluster\Data\daclupdate.lock
Released exclusive file C:\SfDevCluster\Data\daclupdate.lock
Directory:C:\SfDevCluster\Data has been updated with ACL (Account|Sid) ServiceFabricAdministrators|S-1-5-21-2662639430-4145260883-3466995100-1016 ServiceFabricAllowedUsers|S-1-5-21-2662639430-4145260883-3466995100-1017
Obtained exclusive file C:\SfDevCluster\Log\daclupdate.lock
Released exclusive file C:\SfDevCluster\Log\daclupdate.lock
Directory:C:\SfDevCluster\Log has been updated with ACL (Account|Sid) ServiceFabricAdministrators|S-1-5-21-2662639430-4145260883-3466995100-1016
FolderACLManager::Install successful
Starting EventTraceInstaller::Install
EventTraceInstaller::Install successful
Starting CrashDumps::Install
CrashDumps::Install successful
Starting DriverInstallManager::Install
SFVolumeDiskService is not enabled (OnInstall).
Stopping Driver: LeasLayr.
Stopping Driver: KtlLogger.
DriverInstallManager::Install successful
Starting FabricDeployer::Install
CreateProcess Successful for CommandLine:FabricDeployer.exe. ProcessId:19316 MainThreadId:22768 ProcessHandle:24c
Configuration Deployment failed with error 0xffffffff
FabricDeployer::Install failed with error 0xffffffff
FabricDeployer::Install failed with error 0xffffffff, Rolling back
Starting FabricDeployer::Uninstall
CreateProcess Successful for CommandLine:FabricDeployer.exe /operation:Rollback. ProcessId:19484 MainThreadId:4948 ProcessHandle:248
FabricDeployer::Uninstall successful
Starting DriverInstallManager::Uninstall
SFVolumeDiskService is not enabled (OnUninstall).
Stopping Driver: LeasLayr.
Stopping Driver: KtlLogger.
DriverInstallManager::Uninstall successful
Starting CrashDumps::Uninstall
Reset crash dump location to default
CrashDumps::Uninstall successful
Starting EventTraceInstaller::Uninstall
EventTraceInstaller::Uninstall successful
S

As you can see, there's 3 lines in the middle that indicate some kind of failure:

Configuration Deployment failed with error 0xffffffff
FabricDeployer::Install failed with error 0xffffffff
FabricDeployer::Install failed with error 0xffffffff, Rolling back

When I open up Event Viewer, look under the "Error" tab I see 3 Service Fabric sections:

image

The first section just contains the two messages "Install failed with error 0xffffffff" and "Install failed with error 0xffffffff, Rolling back" 30 times each, the second one has the message "Kernel crash upload is configured but failed to get kernel crash dump folder." 12 times, and the last one has this message:

System.IO.FileNotFoundException: Could not find file 'C:\Users\emmurra\AppData\Local\Temp\EM-DESKTOP-Server-ScaleMin.xml'.
File name: 'C:\Users\emmurra\AppData\Local\Temp\EM-DESKTOP-Server-ScaleMin.xml'
   at System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath)
   at System.IO.FileStream.Init(String path, FileMode mode, FileAccess access, Int32 rights, Boolean useRights, FileShare share, Int32 bufferSize, FileOptions options, SECURITY_ATTRIBUTES secAttrs, String msgPath, Boolean bFromProxy, Boolean useLongPath, Boolean checkHost)
   at System.IO.FileStream..ctor(String path, FileMode mode, FileAccess access, FileShare share)
   at System.Fabric.FabricDeployer.XmlHelper.ReadXml[T](String fileName, String schemaFile)
   at System.Fabric.FabricDeployer.DeploymentParameters.CreateFromFile()
   at System.Fabric.FabricDeployer.CommandLineInfo.Parse(String[] args)
   at System.Fabric.FabricDeployer.Program.Main(String[] args)

... Bingo? (EM-DESKTOP is my PC name). Navigating to C:\Users\emmurra\AppData\Local\Temp\ shows me that the file above is indeed not there, and there is no *-Server-ScaleMin.xml of any kind either (one of the above fixes involved changing my computer name, which I thought might be a problem).

I can't think of anything else to debug this anymore, but I think the Service Fabric team has enough to at least investigate this bug... Please feel free to email/ping me (emmurra) if you need more logs/info/want to run a share screen session.

sowenzhang commented 3 years ago

ah, I actually forgot I left a comment here. My issue was resolved after finding this post: https://stackoverflow.com/a/38073418/598562

It worked twice on my machine. But I don't understand why we have to do that. lol

walkerrandolphsmith commented 2 years ago

@sowenzhang this worked for me as well.