microsoft / service-fabric

Service Fabric is a distributed systems platform for packaging, deploying, and managing stateless and stateful distributed applications and containers at large scale.
https://docs.microsoft.com/en-us/azure/service-fabric/
MIT License
3.02k stars 399 forks source link

On premise Service Fabric Cluster - Reliability Level Model Validation preventing Node removal #619

Open Adebeer opened 5 years ago

Adebeer commented 5 years ago

Our production SF cluster recently experienced critical failure of 6 nodes - unfortunately, this production cluster became pretty much useless due to a bug that prevents further config upgrades.

Background: we have a 18 node cluster that spans 3 DCs. In each DC, I configured 3 nodes as primary seed nodes thus giving total of 9 seed nodes. Using gMSA security and SF version is 6.4.622.9590.

The problem is that I couldn't remove these nodes due to the number of seed nodes (and thus reliability level) that dropped from 9 (Platinum) to 6 (Silver). Worse even...because these nodes are invalid, I can't do any config upgrades until I remove the nodes!

Start-ServiceFabricClusterConfigurationUpgrade : System.Runtime.InteropServices.COMException (-2147017627)
ValidationException: Model validation error. Removing a non-seed node and changing reliability level in the same
upgrade is not supported. Initiate an upgrade to remove node first and then change the reliability level.
At line:1 char:1
+ Start-ServiceFabricClusterConfigurationUpgrade -ClusterConfigPath "AL ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (Microsoft.Servi...usterConnection:ClusterConnection) [Start-ServiceFa
   ...gurationUpgrade], FabricException
    + FullyQualifiedErrorId : StartClusterConfigurationUpgradeErrorId,Microsoft.ServiceFabric.Powershell.StartClusterC
   onfigurationUpgrade

The bug is causing another (serious) related problem in cluster planning. The reliability thresholds < 9 are really small and close to each other. This means that in reality, I must ensure that I maintain an unreasonable number of redundant nodes above the 9 threshold to ensure that I end up with the same reliability level.

In my case - having a cluster that spans 3 DCs (which is the recommended minimum for multi-region DCs) - it means:

For completeness sake: a) Get-ServiceFabricClusterConfiguration already does not return the nodes I want to remove b) If I make no change to json file other than increment config version - validation fails with NodesToBeRemoved not being specified c) If I add 1 node to NodesToBeRemoved, I still get above validation error d) If I add all nodes - I get the reliability level upgrade validation error e) I've already removed node states and uninstalled SF SDK - thus leaving all 6 nodes in the "Invalid State". The Get-ServiceFabricClusterConfiguration does not return these 6 nodes, but they are still shown in SF Explorer and listed in the cluster manifest XML file. f) I've tried re-adding same nodes (but with different node names) - but then get a validation error saying IP addresses for new nodes have to be same. Fair enough... but I'm also not convinced you can add more than one node to a gMSA cluster via config upgrade - MSDocs seem to imply 1 g) I can't see any way to change reliability. I recall this being configurable in earlier versions of SF, but since this has been changed and dynamically calculated. I can't see any reference to reliability level in json or xml cluster configuration. I'm guessing that since I used to have 9 seed nodes, cluster was gold, and now going to 6 it's changing to Silver. In any case, this doesn't help me cause I can't control this.

Could we please fix this issue asap!! Honestly, this particular issue is making us reconsider moving away from SF.

Note: I have also mentioned this issue in 772 and 10813 but I think this particular scenario is a bit different and warrants it's own discussion - apologies for the duplication.

Adebeer commented 5 years ago

Any help/insight on above scenario?

I've reconfigured a new cluster with 18 nodes - however I made them all primary in fear of above issue. That said, SF only currently has 9 of these 18 as Seed Nodes - I presume if there were some node failures, SF will auto compensate by turning some of the other nodes into Seed nodes in order to retain minimum 9

Current configuration is also not optimal though cause the last 6 nodes I added is in a different DC to the first 12 nodes - so this last DC has no Seed nodes. Not sure if I can change/control this? (other than perhaps manually/temporarily deactivating some nodes?)

My preference is to go back to what I originally had - make 3 nodes per DC as Primary - but I'm not comfortable to do this until I get some clarity about above situation from MS.

Thanks in advance!

Adebeer commented 5 years ago

Ok, so with this new cluster, I did a configuration update to remove 1 seed node.

This worked as advertised - and as expected, one of my other nodes became a seed node.

So...I went through the exact process to remove a 2nd node...but for whatever reason, this failed.

Essentially, this node is shown as in Error.

I used Get-ServiceFabricClusterConfiguration - and I can see this node still being listed in the Nodes section as well as NodesToBeRemoved.

However, I can't do any more config upgrades due to

Start-ServiceFabricClusterConfigurationUpgrade : System.Runtime.InteropServices.COMException (-2147017627)
ValidationException: Model validation error. NodesToBeRemoved parameter contains inconsistent information. To remove a node, remove it from the nodes list in the JSON config and
specify the correct node name in NodesToBeRemoved parameter.
At line:1 char:1
+ Start-ServiceFabricClusterConfigurationUpgrade -ClusterConfigPath "AL ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (Microsoft.Servi...usterConnection:ClusterConnection) [Start-ServiceFa...gurationUpgrade], FabricException
    + FullyQualifiedErrorId : StartClusterConfigurationUpgradeErrorId,Microsoft.ServiceFabric.Powershell.StartClusterConfigurationUpgrade

Tried removing the node from NodesToBeRemoved vs just removing it from Nodes section...but... similar to my original post - I'm now stuck!

So, to me it seems like one cannot safely remove nodes via pure config upgrades either.

Please help! This seems like a serious bug to me.

Adebeer commented 5 years ago

So, to resolve the above issue, here's what I had to do: a) Use the SF explorer to remove node state - this changed node state from Error to Invalid b) Get latest json config via Get-ServiceFabricClusterConfiguration c) Remove the node from Nodes section d) Completely remove the NodesToBeRemoved json section (i.e. you'll get the inconsistent error if you have an empty list of nodes to be removed - so just remove the containing json block e) Do a config update This successfully removed the node. Note: Initially I tried just doing (b)-(e) above - but it didn't work and the node remained in error state.

I think this process is quite confusing and error prone - and the existing MSDOC documentation doesn't really help - primarily due to this, in my view, being a bug in how node removal works.

But at least for now I seem to have a workaround...

Adebeer commented 5 years ago

Some more info about the above. I did quite a bit of testing last week to add/remove several nodes. I also removed enough nodes to drop the Seed nodes from 9 to 6.

All in all, the good news is that I was able to do config upgrades successfully to restore the cluster.

That said, from my experience, please note the following when removing nodes: a) You can remove multiple Seed nodes at once (I wanted to do this to try and replicate above scenario) b) You can add multiple nodes at once too - just be aware you may not see any activity/indication via SF config upgrade status tooling that anything is happening... be prepared to wait at least +15 minutes (depends on how many nodes you're adding...afterall, SF is copying installation files to the nodes) c) Sometimes, when removing one or more nodes, the node won't be successfully removed - but left in an Error status. If this is the case, use the SF Explorer (or powershell) to remove node state. Status will change to Invalid. At this point, do another config upgrade ensuring that:

The latter part really is the confusing part that tripped me up last time. The thing to also remember is that, once you successfully remove nodes, the Get-ServiceFabricClusterConfiguration will STILL return the removed nodes in the NodesToBeRemoved parameter. This will likely confuse/trip you up with any subsequent attempts to do a config upgrade. As such, I recommend you do another final config upgrade with this section completely removed.

As a final note - if you re-add a node that has previously been removed, it may come back in a Deactivated status. Simply activate this node and all should be fine.

masnider commented 5 years ago

@Adebeer - sorry for the delay in responding here! We should have tagged @dkkapur on this, so I missed it in my scrub until just now. Deep is best positioned to help out with these cluster topologies and layout for standalone. What you're trying to do is definitely supported however we probably can make some improvements per your last message here. Overall glad you got most of this worked out, though sorry about the difficulty.

I think we've been looking recently at some document and product improvements particular around seed node management, but I'll let Deep comment on it.

Adebeer commented 5 years ago

Ok, just an update on above issue. Unfortunately, I've hit the same problem as before and I can't recover from it.

As per my comments above, I did some testing previously and was able to remove multiple nodes without problem - even if it affected the reliability level; however this seem to have only worked because I added/removed nodes from a fully functional cluster with nodes that weren't actually in some sort of SF error state.

This time however (as per my original post that started this thread), we already had 2 nodes in error (only 1 of these nodes is a seed node). I've tried above approaches to remove the nodes - including manually installing SF on the nodes and removing node state (as such nodes are in invalid state). Still, can't get a configuration upgrade to succeed.

The current configuration PS command doesn't return the nodes and it has no "NodesToBeRemoved" section... yet:

ValidationException: Model validation error. Nodes have been removed from the JSON config but NodesToBeRemoved parameter does not contain NodeName for all removed nodes. Please ensure NodesToBeRemoved parameter is specified in Setup section inside FabricSettings and node names of all nodes which need to be removed from the cluster are mentioned.

Start-ServiceFabricClusterConfigurationUpgrade : System.Runtime.InteropServices.COMException (-2147017627) ValidationException: Model validation error. Removing a non-seed node and changing reliability level in the same upgrade is not supported. Initiate an upgrade to remove node first and then change the reliability level. At line:1 char:1

Adebeer commented 5 years ago

Ok... good news... I was able to get around the above...mostly luck though... as I still believe there's a bug with tooling

To summarize, this is what I had to do: