microsoft / service-fabric

Service Fabric is a distributed systems platform for packaging, deploying, and managing stateless and stateful distributed applications and containers at large scale.
https://docs.microsoft.com/en-us/azure/service-fabric/
MIT License
3.03k stars 401 forks source link

Azure SF fails to start when "NodeType" placement property is set. #707

Open oskarm93 opened 6 years ago

oskarm93 commented 6 years ago

I created an Azure SF cluster with 3 node types, on each node I added a Placement Property "NodeType" with some values. I wasn't aware it was one of the built-in properties.

Now the cluster fails to start up with error:

System.Fabric.FabricDeployer.ClusterManifestValidationException: Cluster manifest validation failed with exception System.ArgumentException: Invalid property name NodeType found under PlacementProperties; it is duplicating one of the system defined placement property.
   at System.Fabric.Management.WindowsFabricValidator.FabricSettingsValidator.WriteError(String format, Object[] args)
   at System.Fabric.Management.WindowsFabricValidator.FabricSettingsValidator.CheckForSystemDefinedPlacementPropertyInNodeTypeInfo(KeyValuePairType[] keyValuePairs, String[] systemDefinedPlacementProperty, String sectionName)
   at System.Fabric.Management.WindowsFabricValidator.FabricSettingsValidator.VerifyNodeTypes()
   at System.Fabric.Management.WindowsFabricValidator.FabricSettingsValidator.ValidateSettings()
   at System.Fabric.Management.WindowsFabricValidator.FabricValidator.Validate()
   at System.Fabric.FabricDeployer.FabricValidatorWrapper.ValidateAndEnsureDefaultImageStore()
   at System.Fabric.FabricDeployer.FabricValidatorWrapper.ValidateAndEnsureDefaultImageStore()
   at System.Fabric.FabricDeployer.UpdateNodeStateOperation.OnExecuteOperation(DeploymentParameters parameters, ClusterManifestType clusterManifest, Infrastructure infrastructure)
   at System.Fabric.FabricDeployer.CreateorUpdateOperation.OnExecuteOperation(DeploymentParameters parameters, ClusterManifestType clusterManifest, Infrastructure infrastructure)
   at System.Fabric.FabricDeployer.DeploymentOperation.ExecuteOperationPrivate(DeploymentParameters parameters)
   at System.Fabric.FabricDeployer.DeploymentOperation.ExecuteOperation(DeploymentParameters parameters, Boolean disableFileTrace)
   at System.Fabric.FabricDeployer.Program.Main(String[] args)
Error: FabricDeployer.exe failed

If I try to remove the properties via the Azure Portal, the cluster will restart, but the change will not have any effect. The crash counter just keeps incrementing: 2018-09-03 20:07:00.2383|INFO|ServiceFabricNodeBootstrapAgent|Current fabric.exe crash count : 13

So far the only way I see of recovering this cluster is to trash it and rebuild a new one.

ashishnegi commented 6 years ago

If I try to remove the properties via the Azure Portal, the cluster will restart, but the change will not have any effect. The crash counter just keeps incrementing:

So you mean, it is still crashing after you removed the property via Azure Portal. Can you login into machine and see if ClusterManifest.*.xml is updated ?

oskarm93 commented 6 years ago

@ashishnegi Sorry, I had to delete the cluster and recreate it without the placement properties. When I tried to remove the placement property from the Azure portal, the nodes would try to start up again, but to no effect. The placement property would remain in the Azure portal after refreshing the page.

maburlik commented 6 years ago

Can you provide a sample of the way to set up these placement properties, so we can try to duplicate it on our end?

oskarm93 commented 6 years ago

@maburlik This is all you need. All other settings can be left as default: image

After deploying, the VM will contain lots of FabricDeployer-*.trace files on the D: drive:

2018/09/19-07:02:29.914,Info,5616,FabricDeployer.FabricDeployer,Running deployer with Configure /fabricBinRoot: /fabricDataRoot:D:\SvcFab /fabricLogRoot:D:\SvcFab\Log /cm:C:\WindowsAzure\Logs\Plugins\Microsoft.Azure.ServiceFabric.ServiceFabricNode\1.1.0.2\TempClusterManifest.xml /oldClusterManifestString: /im:C:\WindowsAzure\Logs\Plugins\Microsoft.Azure.ServiceFabric.ServiceFabricNode\1.1.0.2\InfrastructureManifest.xml /instanceId: /targetVersion: /nodeName: /nodeTypeName: /runAsType: /runAsAccountName: /runAsPassword: /serviceStartupType: /output: /currentVersion: /error: /bootstrapMSIPath: /machineName: /fabricPackageRoot: /jsonClusterConfigLocation: /enableCircularTraceSession:False /continueIfContainersFeatureNotInstalled: /skipDeleteData:
2018/09/19-07:02:29.923,Info,5616,ImageStoreClient.ManagedFileLock,Obtained writer lock for D:\SvcFab\lock
2018/09/19-07:02:29.927,Info,5616,FabricDeployer.FabricDeployer,Executing Configure /fabricBinRoot: /fabricDataRoot:D:\SvcFab /fabricLogRoot:D:\SvcFab\Log /cm:C:\WindowsAzure\Logs\Plugins\Microsoft.Azure.ServiceFabric.ServiceFabricNode\1.1.0.2\TempClusterManifest.xml /oldClusterManifestString: /im:C:\WindowsAzure\Logs\Plugins\Microsoft.Azure.ServiceFabric.ServiceFabricNode\1.1.0.2\InfrastructureManifest.xml /instanceId: /targetVersion: /nodeName: /nodeTypeName: /runAsType: /runAsAccountName: /runAsPassword: /serviceStartupType: /output: /currentVersion: /error: /bootstrapMSIPath: /machineName: /fabricPackageRoot: /jsonClusterConfigLocation: /enableCircularTraceSession:False /continueIfContainersFeatureNotInstalled: /skipDeleteData:
2018/09/19-07:02:29.998,Info,5616,ImageBuilder.FabricDeployer,Host name: type1000000.
2018/09/19-07:02:30.015,Info,5616,ImageBuilder.FabricDeployer,Network interface ip address: fe80::a4a9:76fe:3742:3336%13.
2018/09/19-07:02:30.015,Info,5616,ImageBuilder.FabricDeployer,Network interface ip address: 10.0.0.4.
2018/09/19-07:02:30.015,Info,5616,ImageBuilder.FabricDeployer,Network interface ip address: fe80::58f6:6a78:3738:4b59%6.
2018/09/19-07:02:30.015,Info,5616,ImageBuilder.FabricDeployer,Network interface ip address: 172.22.64.1.
2018/09/19-07:02:30.015,Info,5616,ImageBuilder.FabricDeployer,Network interface ip address: ::1.
2018/09/19-07:02:30.015,Info,5616,ImageBuilder.FabricDeployer,Network interface ip address: 127.0.0.1.
2018/09/19-07:02:30.015,Info,5616,ImageBuilder.FabricDeployer,Network interface ip address: fe80::5efe:172.22.64.1%5.
2018/09/19-07:02:30.015,Info,5616,ImageBuilder.FabricDeployer,Network interface ip address: fe80::5efe:10.0.0.4%17.
2018/09/19-07:02:30.016,Info,5616,FabricDeployer.FabricDeployer,Running operation System.Fabric.FabricDeployer.ConfigureOperation
2018/09/19-07:02:30.027,Info,5616,FabricDeployer.FabricDeployer,Creating FabricDataRoot D:\SvcFab, if it doesn't exist on machine 
2018/09/19-07:02:30.027,Info,5616,FabricDeployer.FabricDeployer,Creating FabricLogRoot D:\SvcFab\Log, if it doesn't exist on machine 
2018/09/19-07:02:30.031,Info,5616,FabricDeployer.FabricDeployer,The current machine IP addresses are: fe80::a4a9:76fe:3742:3336%13, fe80::58f6:6a78:3738:4b59%6, 10.0.0.4, 172.22.64.1, 127.0.0.1, ::1
2018/09/19-07:02:30.155,Warning,5616,ImageBuilder.FabricDeployer,Deprecated configuration is used in a cluster manifest, with SectionName: Security and SettingName: ClientAuthAllowedCommonNames
2018/09/19-07:02:30.188,Info,5616,ImageStoreClient.ManagedFileLock,Released writer lock on D:\SvcFab\lock
2018/09/19-07:02:30.192,Error,5616,FabricDeployer.FabricDeployer,System.Fabric.FabricDeployer.ClusterManifestValidationException: Cluster manifest validation failed with exception System.ArgumentException: Invalid property name NodeType found under PlacementProperties; it is duplicating one of the system defined placement property.
       at System.Fabric.Management.WindowsFabricValidator.FabricSettingsValidator.WriteError(String format, Object[] args)
       at System.Fabric.Management.WindowsFabricValidator.FabricSettingsValidator.CheckForSystemDefinedPlacementPropertyInNodeTypeInfo(KeyValuePairType[] keyValuePairs, String[] systemDefinedPlacementProperty, String sectionName)
       at System.Fabric.Management.WindowsFabricValidator.FabricSettingsValidator.VerifyNodeTypes()
       at System.Fabric.Management.WindowsFabricValidator.FabricSettingsValidator.ValidateSettings()
       at System.Fabric.Management.WindowsFabricValidator.FabricValidator.Validate()
       at System.Fabric.FabricDeployer.FabricValidatorWrapper.ValidateAndEnsureDefaultImageStore()
       at System.Fabric.FabricDeployer.FabricValidatorWrapper.ValidateAndEnsureDefaultImageStore()
       at System.Fabric.FabricDeployer.ConfigureOperation.OnExecuteOperation(DeploymentParameters parameters, ClusterManifestType clusterManifest, Infrastructure infrastructure)
       at System.Fabric.FabricDeployer.DeploymentOperation.ExecuteOperationPrivate(DeploymentParameters parameters)
       at System.Fabric.FabricDeployer.DeploymentOperation.ExecuteOperation(DeploymentParameters parameters, Boolean disableFileTrace)
       at System.Fabric.FabricDeployer.Program.Main(String[] args)

You can try to remove the property from the Azure portal, but I received an error image

Failed to save node type 'type1' for cluster 'github'. Error: There was an error processing your request. Try again in a few moments.
linmeng08 commented 6 years ago

Thanks for reporting, there's a list of strings (NodeType, NodeName, UpgradeDomain, FaultDomain) that can't be used in the node type properties, we need to add validation both on the Azure portal side as well as the Service Fabric RP backend to block the deployment if any of those are entered. I will follow up.