Closed hkuizenga closed 4 years ago
Any feedback on the above question? As I understand, our configuration is still supported and we have production clusters that we'll need to add nodes to shortly. Thanks.
AddNode.ps1 is just a helper script. You can take a look at the commands being invoked in the script and it should be a small modification if you want to run it within a secure cluster. For adding nodes you basically need to invoke Add-ServiceFabricNode cmdlet. The script wraps it up for you.
If I modify AddNode.ps1 to have -WindowsCredential
on the Connect-ServiceFabricCluster, should that be enough?
Yes that should work. Please make sure you have internet access or explicitly provide the $FabricRuntimePackagePath.
That made progress but ultimately failed. Powershell output is attached below. Any insight on what we need to do to get past this?
Is this machine part of the machine list/machine group whitelisted under "ClusterIdentity". Can you also make sure there is no pre-existing Fabric installation on this machine. You might want to run CleanFabric.psd1 on this machine.
Yes it is. I ran CleanFabric.ps1, re-tried and got the same result. Could the following two lines in the error be a clue?
4/11/2018 11:31:25 AM - PROD-IN-4 - Error - Config for Hosting.FirewallPolicyEnabled changed but this entry can not be dynamically updated 4/11/2018 11:31:25 AM - PROD-IN-4 - Error - Configuration can not be updated dynamically for Hosting/FirewallPolicyEnabled
What else can we look at?
I may be making progress. We realized that the machine group we have defined in "ClusterIdentity" wasn't part of the new node's Administrator group. We resolve that and it at least seems like the firewall messages have gone away. Now we just get a lot of these....
4/11/2018 11:35:36 AM - PROD-IN-4 - Error - communication open failed with FABRIC_E_INVALID_CONFIGURATION 4/11/2018 11:35:36 AM - PROD-IN-4 - Error - Fabric Node open failed with error code = FABRIC_E_INVALID_CONFIGURATION
... followed eventually by (in the FabricInstallerService... trace file):
2018-04-13 22:27:04.424,Info ,852,General.FabricInstallerServiceImpl,service stopping (shutdown = false) ... 2018-04-13 22:27:04.424,Info ,852,General.FabricInstallerServiceImpl,Stop FabricUpgradeManager called 2018-04-13 22:27:04.424,Info ,1044,General.FabricInstallerServiceImpl,Close FabricUpgradeManager, with timeout 5:00.000
... and ultimately followed by (also in FabricInstallerService trace file):
2018-04-13 22:27:38.575,Error ,4788,FabricInstallerService.FabricUpgradeManager,Target information file exists. This would indicate that Fabric node open or Fabric uninstall didn't happen successfully. Rolling back.. 2018-04-13 22:27:38.575,Warning ,4788,FabricInstallerService.FabricUpgradeManager,Rollback cannot be performed since the current installation is not present or invalid 2018-04-13 22:27:38.575,Warning ,4788,FabricInstallerService.FabricUpgradeManager,Upgrade finished with error FABRIC_E_UPGRADE_FAILED`
The new node must get it's configuration information from the existing nodes, so I went and ran a Test-ServiceFabricConfiguration
on my ClusterConfig file from one the existing seed nodes. The attached file has the output.
I do see errors regarding a previous installation. How do I resolve these? Could this be causing my AddNode problem? Anything else here that might lead me to a resolution?
Thanks!
So, I suspect that the results of Test-ServiceFabricConfiguration
above were the result of the fact that service fabric is already installed on those nodes. I re-ran TestConfiguration.ps1
using a config file with the new node defined as well as the current config as the "old" parameter, and it all passed. I ran this on the node I'm trying to add.
In addition, after each attempt, SF explorer shows a new node named nodeid:c995cea53f8e67e98b84c58185b53e10
(marked as Down), which I have to explicitly remove using Remove Node State
. The node name is exactly the same after each attempt. Perhaps this is a clue to the problem?
My cluster is currently at v6.1.472.9494
and I'm using the CAB file for that version with each AddNode
attempt I make.
Any additional thoughts would be much appreciated.
During my last attempt, I watched Event Viewer to see if there were any hints about what went wrong and when. The first three warnings I saw were as follows:
Copy of Powershell from C:\Program Files\Microsoft Service Fabric\bin\WindowsFabric to C:\Windows\system32\WindowsPowerShell\v1.0\Modules\WindowsFabric Failed
CreateFileW failed: file=\\?\C:\ProgramData\SF\FabricHostSettings.xml error=32
ParseConfigSettings: ErrorCode=E_FAIL, FileName=C:\ProgramData\SF\FabricHostSettings.xml
I've verified that I have administrator rights on the new node and all existing nodes in the cluster and that I'm running Powershell in Admin mode.
Any thoughts?
Linking the duplicate issue from MSDN: https://social.msdn.microsoft.com/Forums/azure/en-US/b5cbaa3b-2d4a-4817-b3c1-8515f570fd56/unable-to-add-node-to-onpremise-cluster?forum=AzureServiceFabric
@rakshitatandon, @dkkapur, @rishirsinha any other suggestions on this? Not sure if we should just remove the node and start again with the addnode.ps1 including the -windowscrendential
from the beginning.
To clarify, I've removed the partially installed node (nodeid:c995cea53f8e67e98b84c58185b53e10
) a number of times and re-installed with AddNode.ps1
w/ -WindowsCredential
. Each time it produces the same results documented above.
Thanks for linking the issue above. I submitted it there as well in an attempt to get more eyes on the issue.
From the above, it seems like Fabric node is failing to open. My guess is this is most likely an authentication failure. Can you share the json you are using ? We'll also need access to the traces from the cluster to detect why authentication is failing. The machine you are adding is domain joined and that domain is whitelisted in your configuration ?
Thanks @rakshitatandon. I'll get that information for you tomorrow. Is there a way I can get you that information without posting it on the forum?
Yes please mail it to ratando@microsoft.com.
Thank you @rakshitatandon for your help! A review of the trace files revealed that we hadn't installed and ACL'ed the cert associated with our reverse proxy on our new node. After resolving this, AddNode.ps1 succeeded. However, running the ClusterConfig upgrade produced the following error on one of our other nodes.
We're working through this error over e-mail and I'll post the resolution once we have it. In the meantime, there are three improvements I would suggest so far:
Thanks!
It looks like the issue above is unrelated to adding a node. I re-ran the cluster upgrade with -MaxPercentUnhealthyNodes = 20
and we made it through the upgrade successfully, completing the add node process. As such, I'm going to close out this thread. Thank you to the MS folks for all of your help!
Please reopen this case. AddNode.ps1 shall have an option to use WindowsAuthentication out of the box
Where to add -WindowsAuthentication in AddNode.ps1 Script ? Kinldy help me on that
In the AddNode.ps1 script, there is a call to Connect-ServiceFabricCluster. You should modify the AddNode.ps1 to add -WindowsAuthentication to that call.
@artisticcheese Am assuming @Phineas019 wants to use that :)
Thanks guys for your wonderful response i managed to find a solution by enabling all the existing parameters as mandatory. So it doesn't throw any error it connects to existing SF and get the fabric runtime from the SF. Now it's running without error.
This thread is really for Microsoft folks to fix the problem and not close the issue as it was before.
@Phineas019 Thanks for confirming. I will see if we can add a param on AddNode.ps1 to take in 'WindowsCredential' and conditionalize it.
A fix for the script is checked in.. will ship with the 6.4 release.
Closing this issue as the flag was added and shipped as part of last release.
I am attempting to add a node to an existing cluster secured with Windows Security (but not using gMSA). When I run AddNode.ps1, I get the following warnings:
... followed by an error that begins with...
In looking at the implementation of AddNode.ps1, since the Connect_ServiceFabricCluster call is inside the try block, a failure to connect will result in the "runtime package" error. Because Connect-ServiceFabricCluster does not have the -WindowsCredential flag, I believe the connection is failing and is inducing the errors above.
So the question is, how do I add a node to a cluster secured with Windows Security, but not using gMSA? The instructions at https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-windows-server-add-remove-nodes do not seem to cover this.