Open nembo81 opened 6 years ago
How did u add that node? Did u run config upgrade, or just AddNode.ps1? Need to take a look at the several json files: v1: before adding node. v2: target of adding node. v3: target of removing node. Pls remove sensitive info from the json before sharing it.
Hi there,I used AddNode.ps1 but i didn't try to remove that new node, but instead one of the 6 original cluster nodes. I send you 2 files : 1)json with all the nodes and cluster running "normal",2)json target of removing nodes.Keep in mind that after a node removal,the json is still the same as the second json (the node has been removed but the "nodestoberemoved" section is still present).
The jsons looks good. Need to take a look at the trace logs. For compliance concerns, could u contact Microsoft support to upload related trace logs which cover the time range of the several upgrades?
Thanks for your reply,It is not a production environment,so it is not vital.Just to know,what's the correct json behaviour?Do the old node have to remain in "nodestoberemoved" section or do it has to vanish?Last one,was my procedure correct?(json config + Remove-ServiceFabricNodeState).Thanks.
The procedure of remove node should only include json config upgrade. Remove-ServiceFabricNodeState is not needed. For the correct AddNode and RemoveNode instructions, refer to this article: https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-windows-server-add-remove-nodes
We've been working with On-Prem for quite some time and our production systems are having this exact same issue (even after staying up to date with the releases). We are stuck until we can get this resolved. Rebuilding our entire prod cluster isn't really a viable option.
@iandrennan Could you explain what issue you are facing ? Is the node down after you initiated the upgrade ? Did the upgrade complete ? Is the node in Unknown state ?
The nodes are uninstalled (if I log in to each of them SF is gone as expected) but the Dashboard shows the removed 5 nodes as in "Invalid" state with a ? next to them.
@iandrennan This could happen if the node was removed out-of-band without the upgrade successfully completing. In the internal cluster manifest files you should still be able to see these nodes (even though they were actually deleted). Could you explain the exact set of commands that were used ? You can raise a ticket with us since we'd most likely need access to the logs to help mitigate the issue.
@iandrennan I had the same problem as you. I figured out a fix that I was willing to try on our QA cluster and it paid off.
Initial Removal: I removed a node and that node no longer showed up in the clusterconfig.json or the clustermanifiest.xml but it was still present in the explorer interface. It was listed as invalid.
The Fix: What I ended up doing was digging into every single active node's InfrastructureManifest.xml file.
Navigate to your FabricDataRoot directory and then go to the following path: nodename\Fabric\Fabric.Data
In that directory there is a file called InfrastructureManifest.xml. I made a backup and then turned off read-only. Then I edited the file as an admin(I don't think they really want you editing this file). This file was the only place that I could find that had a reference to the node I removed.
I removed that node from the xml. After removing it on every active node it was gone from the explorer.
What @kms254 has described works fine. However for managing large clusters this is simply not practical. We are in the early stages of implementing this on-premise using automation and being able to reliably add and remove nodes is essential. Had the same issue mentioned above on none seednode type nodes.
This needs urgently looking at and reliable and repeatable documentation creating around it.
@Angelicvorian fair - the documentation currently has steps that should work if followed explicitly. I'll take a stab at making it clearer this week and share an update. If you ran into the same issue as other did earlier, could you walk me through the steps you took to remove the node?
@dkkapur Sorry it's been a while since my last response, little bit busy. So here's the process that works to remove the node. I've just run through this so it's fresh.
These are the steps that seem to work.
As per the original MS article, remove the node from the json file, create the nodestoberemoved section in the config file and increment the version.
Run a cluster configuration upgrade using the newly updated config file.
Wait for it to rollback (which it will do, as it sees the removal failing) Note that this can take some time, 15/20 mins in the case of my small 8 node cluster. It still removes SF from the node in question, so you can see it's completing some tasks.
Next we need to remove the entry for the node in question from the InfrastructureManifest.xml file that resides on every remaining node in the cluster. It's a read-only file by default so permissions need changing before editing it.
Once that's done the Cluster updates and all nodes remaining go green, but the removed node still shows as red.
Run Remove-ServiceFabricNodeState -NodeName
That will then remove the last of the node info in the cluster. The cluster will now go green and you can resume normal operations.
This is the current work around. I've submitted logs requested by MS support today so we'll see if they can make any recommendations.
@Angelicvorian When you tried removing the node, is the node reachable (can ping to it)? Also how many total nodes you had when you removed the node?
So many clusters since this issue :) I think it was a 6 or 8 node cluster. The node is reachable afterwards (ping, RDP, everything else is fine), but the installation of Service Fabric is no longer on the machine, hence why it's failing. SO the removal of the node removed the install, service and config for SF but it didn't remove it from the cluster manifest.
On Mon, 5 Nov 2018 at 17:58, Jitendra Kochhar notifications@github.com wrote:
@Angelicvorian https://github.com/Angelicvorian When you tried removing the node, is the node reachable (can ping to it)? Also how many total nodes you had when you removed the node?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Azure/service-fabric-issues/issues/772#issuecomment-435971765, or mute the thread https://github.com/notifications/unsubscribe-auth/AJkYDS_m5GOGqOuQMMVfFvoNk8ZU6Ru8ks5usHw3gaJpZM4RjHdB .
Thanks. The reason I asked that is because that helps narrow it down. The thread is long and can contain multiple issues. For your case, I believe its an issue I looked at recently so if you can try the following, that will help -
The possible issue is that the upgrade fails because of a health check failure (code bug) and by passing the MaxPercent* values you can circumvent that.
I had the same problem today. Indeed @Angelicvorian solution worked but it's veeery tedious. In my case I tried to remove a node (json config+"NodesToBeRemoved section), it completes the command and after few minutes the node is in Error state (Down).
If I do a Get-ServiceFabricNode, the removed node shows as:
NodeName : V-xxx-PRESFBE09
NodeId : 38879cd3c5xxxxxxxxd1db26a0506
NodeInstanceId : 131872xxxxxx109
NodeType : Backend
NodeStatus : Down
NodeDownTime : 00:02:24
NodeDownAt : 21/11/2018 17:10:58
HealthState : Error
CodeVersion : 6.3.187.9494
ConfigVersion : 9
IsSeedNode : False
IpAddressOrFQDN : V-xxx-PRESFBE09
FaultDomain : fd:/rack1
UpgradeDomain : UD3
NodeDeactivationInfo : EffectiveIntent : RemoveNode
Status : Completed
TaskType : Client
TaskId : 38879cd3c53xxxxxxd6d1db26a0506
Intent : RemoveNode
IsStopped : False
After applying @Angelicvorian workaround it is no longer showing in the exporer...
Regards!
@julioas09 Was your node reachable when you attempted a remove node?
Yeah, it was reachable.
Thanks. Are you able to share logs from the time the issue happened? What is the number of nodes in your cluster?
I've got an issue with on-premise gMSA Service fabric cluster (6.4.622.9590) in that I cannot remove invalid nodes due to model validation error. Specifically, we have a 18 node cluster with 9 seed nodes - and I'm having to remove 6 nodes (3 Seed nodes).
I'm trying to follow MSDOC recommendation of doing config upgrade to remove nodes, however this didn't work due to model validation errors - basically failing with the validationexception stating nodes been removed without updating NodesToBeRemoved. To get around this error, I basically ended up removing node state from all offline nodes. I also uninstalled SF on all affected nodes.
So, after getting all the 6 nodes in Invalid mode, I was able to resolve the above validation exception...only to be presented with the next validation error:
Start-ServiceFabricClusterConfigurationUpgrade : System.Runtime.InteropServices.COMException (-2147017627)
ValidationException: Model validation error. Removing a non-seed node and changing reliability level in the same
upgrade is not supported. Initiate an upgrade to remove node first and then change the reliability level.
At line:1 char:1
+ Start-ServiceFabricClusterConfigurationUpgrade -ClusterConfigPath "AL ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidOperation: (Microsoft.Servi...usterConnection:ClusterConnection) [Start-ServiceFa
...gurationUpgrade], FabricException
+ FullyQualifiedErrorId : StartClusterConfigurationUpgradeErrorId,Microsoft.ServiceFabric.Powershell.StartClusterC
onfigurationUpgrade
I can't see any way to change reliability. I recall this being configurable in earlier versions of SF, but since this has been changed and dynamically calculated. I can't see any reference to reliability level in json or xml cluster configuration. I'm guessing that since I used to have 9 seed nodes, cluster was gold, and now going to 6 it's changing to Silver. In any case, this doesn't help me cause I can't control this.
How do I get around this?? This is blocking me from doing any configuration upgrades on SF cluster.
So just to be clear - I'm stuck unable to do config upgrades because of validation errors.
Specifically:
a) Get-ServiceFabricClusterConfiguration
already does not return the nodes I want to remove
b) If I make no change to json file other than increment config version - validation fails with NodesToBeRemoved not being specified
c) If I add 1 node to NodesToBeRemoved, I still get above validation error
d) If I add all nodes - I get the reliability level upgrade validation error
This is definitely a bug somewhere, we spent many hours on this problem yesterday. In my case i was removing nodes and re-adding them to change the node type.
When the new node is being added it is not updating something on the nodes. The work around is to run a Start-ServiceFabricClusterConfigurationUpgrade with no changes. By that I mean if you are trying to remove 3 nodes and readd them you need to:
The fact is you must run step 3. I don't know exactly what is cleaned up when this is run outside of the fact that the "remove nodes" section is not in the results of a Get-ServiceFabricClusterConfiguration. I'm not sure that is the problem or if its something else that gets fixed. I did look at the infrastructureManfest.xml and at least onces it was off. After running step 3 it was correct.
I hope this helps someone,
Thanks, Greg
I also got the same problem when removing a node from existing cluster. I had to run a few upgrade to get the node completely removed, without manually editing the InfrastructureManifest.xml file on each node as @kms254 and @Angelicvorian did to workaround the issue. Here are my steps
Follow the instruction (https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-windows-server-add-remove-nodes) to remove the node from "Nodes" section, also add it in NodesToBeRemoved setting. Run the Start-ServiceFabricClusterConfigurationUpgrade. Sometimes it goes well, but sometimes it will end up the Node is Down in Error state from the Explorer UI console. Continue following to remove the Down node.
If the node is in Down error state, you can select "Remove node state" from the Explorer UI consol, this will place the node into the Invalid state.
Use Get-ServiceFabricClusterConfiguration to get the latest setting and save it as a new json file, you will see the error node still in the Nodes section, and NodesToBeRemoved setting also contains it. Remove the NodesToBeRemoved setting but keep the node in the Nodes section, increase the ClusterConfigurationVersion. Then run the Start-ServiceFabricClusterConfigurationUpgrade again. This time the upgrade should be fine, that will bring the cluster setting in-sync.
Put back the node into NodesToBeRemoved setting again, remove the node from Nodes: section, and increase the version. Run the Start-ServiceFabricClusterConfigurationUpgrade again, this time the upgrade should be successful and the node will be removed from the Explorer UI console.
Have a look at 1469
From my personal experience - most confusion has been due to misleading error/warning messages that stem from the NodesToBeRemoved setting.
I can confirm that the workaround by @kms254 works.
Below is a script that can get you going if you want to automate this on a large cluster.
So long as all your servers have their Fabric.Data folder in a location that you can swap out the nodename to access it, and can be accessed via UNC from an account, then you should be able to modify the script below to suit your needs.
$nodeToRemove = "webserver1"
$sfConf = Get-ServiceFabricClusterConfiguration | ConvertFrom-Json
$sfConf.Nodes | % {
$targetNode = $_.IPAddress
$nodeName = $_.NodeName
$targetPath = "\\$targetNode\e$\SF\$nodeName\Fabric\Fabric.Data"
$InfraXmlPath = "$targetPath\InfrastructureManifest.xml"
$InfraXmlBackupPath = "$targetPath\InfrastructureManifest.backup.xml"
copy-item $InfraXmlPath $InfraXmlBackupPath -Force
[xml]$xdoc = Get-Content $InfraXmlPath
echo "Saving backup on $nodeName"
$node = ($xdoc.InfrastructureInformation.NodeList.Node | ? { $_.NodeName -eq $nodeToRemove})
if (!($node -eq $null))
{
$node = $xdoc.InfrastructureInformation.NodeList.RemoveChild($node)
echo "Saving Manifest on $nodeName"
Set-ItemProperty -Path $InfraXmlPath -Name IsReadOnly -Value $false
$xdoc.Save($InfraXmlPath)
Set-ItemProperty -Path $InfraXmlPath -Name IsReadOnly -Value $true
}
else
{
echo "$nodeToRemove not found in Manifest on $nodeName"
}
}
Just had this problem again today:
Still, the original problem was the removal of the node through the standard process not going smoothly :(
Hi, we have a 8 nodes bronze level on premise SF Cluster (version 6.0.232.9494 and original installed version was 5.4.164.9494). I just tried to add a node without any issue but when i try to remove a node (json config+"NodesToBeRemoved section),it complete the command and after few minutes the node is in Error state (Down). I executed "Remove-ServiceFabricNodeState" and now the node is in Unknown state. Now when i run "Get-ServiceFabricClusterConfiguration",nodes section is updated without the removed node,but is still present the "NodesToBeRemoved section" with the node name and if i try to change configuration (ex : removing another node) it output the following error : Start-ServiceFabricClusterConfigurationUpgrade : System.Runtime.InteropServices.COMException (-2147017627) ValidationException: Model validation error. Nodes present in the cluster must exactly match the nodes in JSON config when new nodes have been added using AddNode.ps1 script. Run Get-ServiceFabricClusterConfiguration to get the most recent node list. At line:1 char:1
I already tried to remove the "NodesToBeRemoved" section,leave the section with the "old" removed node and last one but the issue persists (json file is correct and updated). Do i did it in the right way? What do I have to expect from a Get-ServiceFabricClusterConfiguration after a node removal? Thanks a lot