microsoft / service-fabric-issues

This repo is for the reporting of issues found with Azure Service Fabric.
168 stars 21 forks source link

AddNode.ps1 fails with on-premise cluster using Windows Security (not gMSA) #974

Closed hkuizenga closed 4 years ago

hkuizenga commented 6 years ago

I am attempting to add a node to an existing cluster secured with Windows Security (but not using gMSA). When I run AddNode.ps1, I get the following warnings:

WARNING: Failed to contact Naming Service. Attempting to contact Failover Manager Service... WARNING: Failed to contact Failover Manager Service, Attempting to contact FMM...

... followed by an error that begins with...

Runtime package cannot be downloaded. Check you internet connectivity.

In looking at the implementation of AddNode.ps1, since the Connect_ServiceFabricCluster call is inside the try block, a failure to connect will result in the "runtime package" error. Because Connect-ServiceFabricCluster does not have the -WindowsCredential flag, I believe the connection is failing and is inducing the errors above.

So the question is, how do I add a node to a cluster secured with Windows Security, but not using gMSA? The instructions at https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-windows-server-add-remove-nodes do not seem to cover this.

hkuizenga commented 6 years ago

Any feedback on the above question? As I understand, our configuration is still supported and we have production clusters that we'll need to add nodes to shortly. Thanks.

rakshitatandon commented 6 years ago

AddNode.ps1 is just a helper script. You can take a look at the commands being invoked in the script and it should be a small modification if you want to run it within a secure cluster. For adding nodes you basically need to invoke Add-ServiceFabricNode cmdlet. The script wraps it up for you.

hkuizenga commented 6 years ago

If I modify AddNode.ps1 to have -WindowsCredential on the Connect-ServiceFabricCluster, should that be enough?

rakshitatandon commented 6 years ago

Yes that should work. Please make sure you have internet access or explicitly provide the $FabricRuntimePackagePath.

hkuizenga commented 6 years ago

That made progress but ultimately failed. Powershell output is attached below. Any insight on what we need to do to get past this?

AddNode Failure.txt

rakshitatandon commented 6 years ago

Is this machine part of the machine list/machine group whitelisted under "ClusterIdentity". Can you also make sure there is no pre-existing Fabric installation on this machine. You might want to run CleanFabric.psd1 on this machine.

hkuizenga commented 6 years ago

Yes it is. I ran CleanFabric.ps1, re-tried and got the same result. Could the following two lines in the error be a clue?

4/11/2018 11:31:25 AM - PROD-IN-4 - Error - Config for Hosting.FirewallPolicyEnabled changed but this entry can not be dynamically updated 4/11/2018 11:31:25 AM - PROD-IN-4 - Error - Configuration can not be updated dynamically for Hosting/FirewallPolicyEnabled

What else can we look at?

hkuizenga commented 6 years ago

I may be making progress. We realized that the machine group we have defined in "ClusterIdentity" wasn't part of the new node's Administrator group. We resolve that and it at least seems like the firewall messages have gone away. Now we just get a lot of these....

4/11/2018 11:35:36 AM - PROD-IN-4 - Error - communication open failed with FABRIC_E_INVALID_CONFIGURATION 4/11/2018 11:35:36 AM - PROD-IN-4 - Error - Fabric Node open failed with error code = FABRIC_E_INVALID_CONFIGURATION

... followed eventually by (in the FabricInstallerService... trace file):

2018-04-13 22:27:04.424,Info ,852,General.FabricInstallerServiceImpl,service stopping (shutdown = false) ... 2018-04-13 22:27:04.424,Info ,852,General.FabricInstallerServiceImpl,Stop FabricUpgradeManager called 2018-04-13 22:27:04.424,Info ,1044,General.FabricInstallerServiceImpl,Close FabricUpgradeManager, with timeout 5:00.000

... and ultimately followed by (also in FabricInstallerService trace file):

2018-04-13 22:27:38.575,Error ,4788,FabricInstallerService.FabricUpgradeManager,Target information file exists. This would indicate that Fabric node open or Fabric uninstall didn't happen successfully. Rolling back.. 2018-04-13 22:27:38.575,Warning ,4788,FabricInstallerService.FabricUpgradeManager,Rollback cannot be performed since the current installation is not present or invalid 2018-04-13 22:27:38.575,Warning ,4788,FabricInstallerService.FabricUpgradeManager,Upgrade finished with error FABRIC_E_UPGRADE_FAILED`

The new node must get it's configuration information from the existing nodes, so I went and ran a Test-ServiceFabricConfiguration on my ClusterConfig file from one the existing seed nodes. The attached file has the output.

I do see errors regarding a previous installation. How do I resolve these? Could this be causing my AddNode problem? Anything else here that might lead me to a resolution?

Thanks!

BPA Trace Output.txt

hkuizenga commented 6 years ago

So, I suspect that the results of Test-ServiceFabricConfiguration above were the result of the fact that service fabric is already installed on those nodes. I re-ran TestConfiguration.ps1 using a config file with the new node defined as well as the current config as the "old" parameter, and it all passed. I ran this on the node I'm trying to add.

In addition, after each attempt, SF explorer shows a new node named nodeid:c995cea53f8e67e98b84c58185b53e10 (marked as Down), which I have to explicitly remove using Remove Node State. The node name is exactly the same after each attempt. Perhaps this is a clue to the problem?

My cluster is currently at v6.1.472.9494 and I'm using the CAB file for that version with each AddNode attempt I make.

Any additional thoughts would be much appreciated.

hkuizenga commented 6 years ago

During my last attempt, I watched Event Viewer to see if there were any hints about what went wrong and when. The first three warnings I saw were as follows:

Copy of Powershell from C:\Program Files\Microsoft Service Fabric\bin\WindowsFabric to C:\Windows\system32\WindowsPowerShell\v1.0\Modules\WindowsFabric Failed

CreateFileW failed: file=\\?\C:\ProgramData\SF\FabricHostSettings.xml error=32

ParseConfigSettings: ErrorCode=E_FAIL, FileName=C:\ProgramData\SF\FabricHostSettings.xml

I've verified that I have administrator rights on the new node and all existing nodes in the cluster and that I'm running Powershell in Admin mode.

Any thoughts?

mimckitt commented 6 years ago

Linking the duplicate issue from MSDN: https://social.msdn.microsoft.com/Forums/azure/en-US/b5cbaa3b-2d4a-4817-b3c1-8515f570fd56/unable-to-add-node-to-onpremise-cluster?forum=AzureServiceFabric

@rakshitatandon, @dkkapur, @rishirsinha any other suggestions on this? Not sure if we should just remove the node and start again with the addnode.ps1 including the -windowscrendential from the beginning.

hkuizenga commented 6 years ago

To clarify, I've removed the partially installed node (nodeid:c995cea53f8e67e98b84c58185b53e10) a number of times and re-installed with AddNode.ps1 w/ -WindowsCredential. Each time it produces the same results documented above.

Thanks for linking the issue above. I submitted it there as well in an attempt to get more eyes on the issue.

rakshitatandon commented 6 years ago

From the above, it seems like Fabric node is failing to open. My guess is this is most likely an authentication failure. Can you share the json you are using ? We'll also need access to the traces from the cluster to detect why authentication is failing. The machine you are adding is domain joined and that domain is whitelisted in your configuration ?

hkuizenga commented 6 years ago

Thanks @rakshitatandon. I'll get that information for you tomorrow. Is there a way I can get you that information without posting it on the forum?

rakshitatandon commented 6 years ago

Yes please mail it to ratando@microsoft.com.

hkuizenga commented 6 years ago

Thank you @rakshitatandon for your help! A review of the trace files revealed that we hadn't installed and ACL'ed the cert associated with our reverse proxy on our new node. After resolving this, AddNode.ps1 succeeded. However, running the ClusterConfig upgrade produced the following error on one of our other nodes.

image

We're working through this error over e-mail and I'll post the resolution once we have it. In the meantime, there are three improvements I would suggest so far:

  1. AddNode.ps1 needs a parameter to allow it to connect using Windows Credentials
  2. TestConfiguration.ps1 should produce an error if necessary certs are not installed
  3. Trace file content needs to be accessible w/o Microsoft Support

Thanks!

hkuizenga commented 6 years ago

It looks like the issue above is unrelated to adding a node. I re-ran the cluster upgrade with -MaxPercentUnhealthyNodes = 20 and we made it through the upgrade successfully, completing the add node process. As such, I'm going to close out this thread. Thank you to the MS folks for all of your help!

artisticcheese commented 6 years ago

Please reopen this case. AddNode.ps1 shall have an option to use WindowsAuthentication out of the box

phineas-in commented 6 years ago

Where to add -WindowsAuthentication in AddNode.ps1 Script ? Kinldy help me on that

jkochhar commented 6 years ago

In the AddNode.ps1 script, there is a call to Connect-ServiceFabricCluster. You should modify the AddNode.ps1 to add -WindowsAuthentication to that call.

jkochhar commented 6 years ago

@artisticcheese Am assuming @Phineas019 wants to use that :)

phineas-in commented 6 years ago

Thanks guys for your wonderful response i managed to find a solution by enabling all the existing parameters as mandatory. So it doesn't throw any error it connects to existing SF and get the fabric runtime from the SF. Now it's running without error.

artisticcheese commented 6 years ago

This thread is really for Microsoft folks to fix the problem and not close the issue as it was before.

jkochhar commented 6 years ago

@Phineas019 Thanks for confirming. I will see if we can add a param on AddNode.ps1 to take in 'WindowsCredential' and conditionalize it.

jkochhar commented 6 years ago

A fix for the script is checked in.. will ship with the 6.4 release.

jkochhar commented 4 years ago

Closing this issue as the flag was added and shipped as part of last release.