Open JeffreyDevloo opened 7 years ago
@JeffreyDevloo
we are a bit clueless. All nodes were up but it says
Node with IP 10.100.69.121 is unreachable
Node with IP 10.100.69.122 is unreachable
Can we reproduce this when all nodes are reachable?
That is the point. The nodes were online but the client could have been cached The nodes were online but the removal says they were not.
@JeffreyDevloo is this somehow related to the other SSH timeout issues you are seeing?
@wimpers I can't tell... There was a command which failed which means the SSHClient could connect to the node so I suppose not. Either way it is something still to be verified what went wrong
Might this be related to the SSH caching of the connections?
No, the execution of the command actually worked. If it would have been the cached services that was killed, we would see an EOFError instead
Problem description
Removing a master node from my cluster was successful however some issues got logged along the way. The most unfamiliar to me: Failed to forget RabbitMQ cluster node
What happened
All nodes were online during the removal
Potential root of the problem
It could be, stated by @khenderick, that the sshclient is given a storagerouter object that did not update its property yet and therefore assuming that the node is down.
Proposed solution
Start the removal with the ip to rebuild the client.
stdout output of my test:
Certain cases like the replying are my logs and thus can be ignore for this part. It indicates what step I have taken during the removal.
lib.log
lig.log on the node that initiated the remove