microsoft / service-fabric

Service Fabric is a distributed systems platform for packaging, deploying, and managing stateless and stateful distributed applications and containers at large scale.
https://docs.microsoft.com/en-us/azure/service-fabric/
MIT License
3.02k stars 399 forks source link

On premise Service Fabric Cluster remove node problem. #836

Open nembo81 opened 6 years ago

nembo81 commented 6 years ago

Hi, we have a 8 nodes bronze level on premise SF Cluster (version 6.0.232.9494 and original installed version was 5.4.164.9494). I just tried to add a node without any issue but when i try to remove a node (json config+"NodesToBeRemoved section),it complete the command and after few minutes the node is in Error state (Down). I executed "Remove-ServiceFabricNodeState" and now the node is in Unknown state. Now when i run "Get-ServiceFabricClusterConfiguration",nodes section is updated without the removed node,but is still present the "NodesToBeRemoved section" with the node name and if i try to change configuration (ex : removing another node) it output the following error : Start-ServiceFabricClusterConfigurationUpgrade : System.Runtime.InteropServices.COMException (-2147017627) ValidationException: Model validation error. Nodes present in the cluster must exactly match the nodes in JSON config when new nodes have been added using AddNode.ps1 script. Run Get-ServiceFabricClusterConfiguration to get the most recent node list. At line:1 char:1

I already tried to remove the "NodesToBeRemoved" section,leave the section with the "old" removed node and last one but the issue persists (json file is correct and updated). Do i did it in the right way? What do I have to expect from a Get-ServiceFabricClusterConfiguration after a node removal? Thanks a lot

Liphi commented 6 years ago

How did u add that node? Did u run config upgrade, or just AddNode.ps1? Need to take a look at the several json files: v1: before adding node. v2: target of adding node. v3: target of removing node. Pls remove sensitive info from the json before sharing it.

nembo81 commented 6 years ago

Hi there,I used AddNode.ps1 but i didn't try to remove that new node, but instead one of the 6 original cluster nodes. I send you 2 files : 1)json with all the nodes and cluster running "normal",2)json target of removing nodes.Keep in mind that after a node removal,the json is still the same as the second json (the node has been removed but the "nodestoberemoved" section is still present).

json.zip

Liphi commented 6 years ago

The jsons looks good. Need to take a look at the trace logs. For compliance concerns, could u contact Microsoft support to upload related trace logs which cover the time range of the several upgrades?

nembo81 commented 6 years ago

Thanks for your reply,It is not a production environment,so it is not vital.Just to know,what's the correct json behaviour?Do the old node have to remain in "nodestoberemoved" section or do it has to vanish?Last one,was my procedure correct?(json config + Remove-ServiceFabricNodeState).Thanks.

Liphi commented 6 years ago

The procedure of remove node should only include json config upgrade. Remove-ServiceFabricNodeState is not needed. For the correct AddNode and RemoveNode instructions, refer to this article: https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-windows-server-add-remove-nodes

iandrennan commented 6 years ago

We've been working with On-Prem for quite some time and our production systems are having this exact same issue (even after staying up to date with the releases). We are stuck until we can get this resolved. Rebuilding our entire prod cluster isn't really a viable option.

rakshitatandon commented 6 years ago

@iandrennan Could you explain what issue you are facing ? Is the node down after you initiated the upgrade ? Did the upgrade complete ? Is the node in Unknown state ?

iandrennan commented 6 years ago

The nodes are uninstalled (if I log in to each of them SF is gone as expected) but the Dashboard shows the removed 5 nodes as in "Invalid" state with a ? next to them.

rakshitatandon commented 6 years ago

@iandrennan This could happen if the node was removed out-of-band without the upgrade successfully completing. In the internal cluster manifest files you should still be able to see these nodes (even though they were actually deleted). Could you explain the exact set of commands that were used ? You can raise a ticket with us since we'd most likely need access to the logs to help mitigate the issue.

kms254 commented 6 years ago

@iandrennan I had the same problem as you. I figured out a fix that I was willing to try on our QA cluster and it paid off.

Initial Removal: I removed a node and that node no longer showed up in the clusterconfig.json or the clustermanifiest.xml but it was still present in the explorer interface. It was listed as invalid.

The Fix: What I ended up doing was digging into every single active node's InfrastructureManifest.xml file.

Navigate to your FabricDataRoot directory and then go to the following path: nodename\Fabric\Fabric.Data

In that directory there is a file called InfrastructureManifest.xml. I made a backup and then turned off read-only. Then I edited the file as an admin(I don't think they really want you editing this file). This file was the only place that I could find that had a reference to the node I removed.

I removed that node from the xml. After removing it on every active node it was gone from the explorer.

Angelicvorian commented 5 years ago

What @kms254 has described works fine. However for managing large clusters this is simply not practical. We are in the early stages of implementing this on-premise using automation and being able to reliably add and remove nodes is essential. Had the same issue mentioned above on none seednode type nodes.

This needs urgently looking at and reliable and repeatable documentation creating around it.

dkkapur commented 5 years ago

@Angelicvorian fair - the documentation currently has steps that should work if followed explicitly. I'll take a stab at making it clearer this week and share an update. If you ran into the same issue as other did earlier, could you walk me through the steps you took to remove the node?

Angelicvorian commented 5 years ago

@dkkapur Sorry it's been a while since my last response, little bit busy. So here's the process that works to remove the node. I've just run through this so it's fresh.

These are the steps that seem to work.

As per the original MS article, remove the node from the json file, create the nodestoberemoved section in the config file and increment the version.

Run a cluster configuration upgrade using the newly updated config file.

Wait for it to rollback (which it will do, as it sees the removal failing) Note that this can take some time, 15/20 mins in the case of my small 8 node cluster. It still removes SF from the node in question, so you can see it's completing some tasks.

Next we need to remove the entry for the node in question from the InfrastructureManifest.xml file that resides on every remaining node in the cluster. It's a read-only file by default so permissions need changing before editing it.

Once that's done the Cluster updates and all nodes remaining go green, but the removed node still shows as red.

Run Remove-ServiceFabricNodeState -NodeName -Force

That will then remove the last of the node info in the cluster. The cluster will now go green and you can resume normal operations.

This is the current work around. I've submitted logs requested by MS support today so we'll see if they can make any recommendations.

jkochhar commented 5 years ago

@Angelicvorian When you tried removing the node, is the node reachable (can ping to it)? Also how many total nodes you had when you removed the node?

Angelicvorian commented 5 years ago

So many clusters since this issue :) I think it was a 6 or 8 node cluster. The node is reachable afterwards (ping, RDP, everything else is fine), but the installation of Service Fabric is no longer on the machine, hence why it's failing. SO the removal of the node removed the install, service and config for SF but it didn't remove it from the cluster manifest.

On Mon, 5 Nov 2018 at 17:58, Jitendra Kochhar notifications@github.com wrote:

@Angelicvorian https://github.com/Angelicvorian When you tried removing the node, is the node reachable (can ping to it)? Also how many total nodes you had when you removed the node?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Azure/service-fabric-issues/issues/772#issuecomment-435971765, or mute the thread https://github.com/notifications/unsubscribe-auth/AJkYDS_m5GOGqOuQMMVfFvoNk8ZU6Ru8ks5usHw3gaJpZM4RjHdB .

jkochhar commented 5 years ago

Thanks. The reason I asked that is because that helps narrow it down. The thread is long and can contain multiple issues. For your case, I believe its an issue I looked at recently so if you can try the following, that will help -

The possible issue is that the upgrade fails because of a health check failure (code bug) and by passing the MaxPercent* values you can circumvent that.

julioas09 commented 5 years ago

I had the same problem today. Indeed @Angelicvorian solution worked but it's veeery tedious. In my case I tried to remove a node (json config+"NodesToBeRemoved section), it completes the command and after few minutes the node is in Error state (Down).

If I do a Get-ServiceFabricNode, the removed node shows as:

NodeName             : V-xxx-PRESFBE09
NodeId               : 38879cd3c5xxxxxxxxd1db26a0506
NodeInstanceId       : 131872xxxxxx109
NodeType             : Backend
NodeStatus           : Down
NodeDownTime         : 00:02:24
NodeDownAt           : 21/11/2018 17:10:58
HealthState          : Error
CodeVersion          : 6.3.187.9494
ConfigVersion        : 9
IsSeedNode           : False
IpAddressOrFQDN      : V-xxx-PRESFBE09
FaultDomain          : fd:/rack1
UpgradeDomain        : UD3
NodeDeactivationInfo : EffectiveIntent : RemoveNode
                       Status : Completed

                       TaskType : Client
                       TaskId : 38879cd3c53xxxxxxd6d1db26a0506
                       Intent : RemoveNode

IsStopped            : False

After applying @Angelicvorian workaround it is no longer showing in the exporer...

Regards!

jkochhar commented 5 years ago

@julioas09 Was your node reachable when you attempted a remove node?

julioas09 commented 5 years ago

Yeah, it was reachable.

jkochhar commented 5 years ago

Thanks. Are you able to share logs from the time the issue happened? What is the number of nodes in your cluster?

Adebeer commented 5 years ago

I've got an issue with on-premise gMSA Service fabric cluster (6.4.622.9590) in that I cannot remove invalid nodes due to model validation error. Specifically, we have a 18 node cluster with 9 seed nodes - and I'm having to remove 6 nodes (3 Seed nodes).

I'm trying to follow MSDOC recommendation of doing config upgrade to remove nodes, however this didn't work due to model validation errors - basically failing with the validationexception stating nodes been removed without updating NodesToBeRemoved. To get around this error, I basically ended up removing node state from all offline nodes. I also uninstalled SF on all affected nodes.

So, after getting all the 6 nodes in Invalid mode, I was able to resolve the above validation exception...only to be presented with the next validation error:

Start-ServiceFabricClusterConfigurationUpgrade : System.Runtime.InteropServices.COMException (-2147017627)
ValidationException: Model validation error. Removing a non-seed node and changing reliability level in the same
upgrade is not supported. Initiate an upgrade to remove node first and then change the reliability level.
At line:1 char:1
+ Start-ServiceFabricClusterConfigurationUpgrade -ClusterConfigPath "AL ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (Microsoft.Servi...usterConnection:ClusterConnection) [Start-ServiceFa
   ...gurationUpgrade], FabricException
    + FullyQualifiedErrorId : StartClusterConfigurationUpgradeErrorId,Microsoft.ServiceFabric.Powershell.StartClusterC
   onfigurationUpgrade

I can't see any way to change reliability. I recall this being configurable in earlier versions of SF, but since this has been changed and dynamically calculated. I can't see any reference to reliability level in json or xml cluster configuration. I'm guessing that since I used to have 9 seed nodes, cluster was gold, and now going to 6 it's changing to Silver. In any case, this doesn't help me cause I can't control this.

How do I get around this?? This is blocking me from doing any configuration upgrades on SF cluster.

Adebeer commented 5 years ago

So just to be clear - I'm stuck unable to do config upgrades because of validation errors.

Specifically: a) Get-ServiceFabricClusterConfiguration already does not return the nodes I want to remove b) If I make no change to json file other than increment config version - validation fails with NodesToBeRemoved not being specified c) If I add 1 node to NodesToBeRemoved, I still get above validation error d) If I add all nodes - I get the reliability level upgrade validation error

gperrego commented 5 years ago

This is definitely a bug somewhere, we spent many hours on this problem yesterday. In my case i was removing nodes and re-adding them to change the node type.

When the new node is being added it is not updating something on the nodes. The work around is to run a Start-ServiceFabricClusterConfigurationUpgrade with no changes. By that I mean if you are trying to remove 3 nodes and readd them you need to:

  1. Remove the node
    • JSON Setup: Add the "remove nodes" section and remove the node from the nodes list
    • Poweshell: Start-ServiceFabricClusterConfigurationUpgrade
  2. Add the new Node
    • Run AddNode.ps1
  3. Run Start-ServiceFabricClusterConfigurationUpgrade a second time to get the cluster stable again
    • JSON Setup: Run Get-ServiceFabricClusterConfiguration and copy/paste the nodes into the JSON
    • JSON Setup: remove the "remove nodes" section from the config
    • Poweshell: Start-ServiceFabricClusterConfigurationUpgrade
  4. Remove the next node by repeating step 1

The fact is you must run step 3. I don't know exactly what is cleaned up when this is run outside of the fact that the "remove nodes" section is not in the results of a Get-ServiceFabricClusterConfiguration. I'm not sure that is the problem or if its something else that gets fixed. I did look at the infrastructureManfest.xml and at least onces it was off. After running step 3 it was correct.

I hope this helps someone,

Thanks, Greg

steve-lu commented 5 years ago

I also got the same problem when removing a node from existing cluster. I had to run a few upgrade to get the node completely removed, without manually editing the InfrastructureManifest.xml file on each node as @kms254 and @Angelicvorian did to workaround the issue. Here are my steps

  1. Follow the instruction (https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-windows-server-add-remove-nodes) to remove the node from "Nodes" section, also add it in NodesToBeRemoved setting. Run the Start-ServiceFabricClusterConfigurationUpgrade. Sometimes it goes well, but sometimes it will end up the Node is Down in Error state from the Explorer UI console. Continue following to remove the Down node.

  2. If the node is in Down error state, you can select "Remove node state" from the Explorer UI consol, this will place the node into the Invalid state.

  3. Use Get-ServiceFabricClusterConfiguration to get the latest setting and save it as a new json file, you will see the error node still in the Nodes section, and NodesToBeRemoved setting also contains it. Remove the NodesToBeRemoved setting but keep the node in the Nodes section, increase the ClusterConfigurationVersion. Then run the Start-ServiceFabricClusterConfigurationUpgrade again. This time the upgrade should be fine, that will bring the cluster setting in-sync.

  4. Put back the node into NodesToBeRemoved setting again, remove the node from Nodes: section, and increase the version. Run the Start-ServiceFabricClusterConfigurationUpgrade again, this time the upgrade should be successful and the node will be removed from the Explorer UI console.

Adebeer commented 5 years ago

Have a look at 1469

From my personal experience - most confusion has been due to misleading error/warning messages that stem from the NodesToBeRemoved setting.

anastasiosyal commented 4 years ago

I can confirm that the workaround by @kms254 works.

Below is a script that can get you going if you want to automate this on a large cluster.

So long as all your servers have their Fabric.Data folder in a location that you can swap out the nodename to access it, and can be accessed via UNC from an account, then you should be able to modify the script below to suit your needs.

$nodeToRemove = "webserver1"

$sfConf = Get-ServiceFabricClusterConfiguration | ConvertFrom-Json
$sfConf.Nodes | % {
    $targetNode =  $_.IPAddress
    $nodeName = $_.NodeName
    $targetPath = "\\$targetNode\e$\SF\$nodeName\Fabric\Fabric.Data"
    $InfraXmlPath = "$targetPath\InfrastructureManifest.xml"
    $InfraXmlBackupPath = "$targetPath\InfrastructureManifest.backup.xml"

    copy-item $InfraXmlPath $InfraXmlBackupPath -Force

    [xml]$xdoc = Get-Content $InfraXmlPath
    echo "Saving backup on $nodeName"

    $node = ($xdoc.InfrastructureInformation.NodeList.Node | ? { $_.NodeName -eq $nodeToRemove})
    if (!($node -eq $null))
    {
        $node = $xdoc.InfrastructureInformation.NodeList.RemoveChild($node)
        echo "Saving Manifest on $nodeName"
        Set-ItemProperty -Path $InfraXmlPath  -Name IsReadOnly -Value $false
        $xdoc.Save($InfraXmlPath)
        Set-ItemProperty -Path $InfraXmlPath -Name IsReadOnly -Value $true
    }
    else
    {
        echo "$nodeToRemove not found in Manifest on $nodeName" 
    }
}
julioas09 commented 4 years ago

Just had this problem again today:

Still, the original problem was the removal of the node through the standard process not going smoothly :(