Open yongshengma opened 6 years ago
That's weird. I tested ovs remove node
and met the error again
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++ An unexpected error occurred: +++
+++ Command line: ['/usr/bin/ssh', 'root@192.168.2.22', 'cd', '/root', '&&', +++
+++ '/usr/bin/python2.7', '/tmp/tmp.DAUobc1IaY/deployed-rpyc.py'] +++
+++ Exit code: 255 +++
+++ Stderr: | ssh_askpass: exec(/usr/bin/ssh-askpass): No such file or directory +++
+++ | Permission denied, please try again. +++
+++ | ssh_askpass: exec(/usr/bin/ssh-askpass): No such file or directory +++
+++ | Permission denied, please try again. +++
+++ | ssh_askpass: exec(/usr/bin/ssh-askpass): No such file or directory +++
+++ | Permission denied (publickey,password). +++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
I find the node can't ssh itself by public key authentication after i ran ovs remove node ...
on it.
Before this error, I have ever failed to run ovs remove node
once, which output didn't give details.
root@NODE-22:~# ovs remove node 192.168.2.21
+++++++++++++++++++++
+++ Remove node +++
+++++++++++++++++++++
WARNING: Some of these steps may take a very long time, please check the logs for more information
Creating SSH connections to remaining master nodes
Node with IP 192.168.2.21 successfully connected to
Node with IP 192.168.2.22 successfully connected to
Node with IP 192.168.2.24 successfully connected to
+++ Running "noderemoval - validate_removal" hooks +++
Executing alba._validate_removal
Are you sure you want to remove node NODE-21? (y/[n]): y
Do you also want to remove the ASD manager and related ASDs? (y/[n]): y
Starting removal of node NODE-21 - 192.168.2.21
Removing vPools from node
Removing vPool pool-1 from node
root@192.168.2.22's password:
+++++++++++++++++++++++++++++++++++++++
+++ An unexpected error occurred: +++
+++++++++++++++++++++++++++++++++++++++
Hi yongshengma
Let's talk about the first snippet: I have a slight suspicion that the ssh-askpass package was removed (ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
). This is a dependency of the rpyc library that we are using. We use it to connect to a master node during the setup to distribute all public ssh keys.
Last snippet:
I've noticed root@192.168.2.22's password:
in the last snippet. This indicates that the current host (NODE-22
) might not be able to connect to itself for some reason while previously it could (Node with IP 192.168.2.22 successfully connected to
).
This is a curious case...
Failures on node removal and node installation can be found in /var/log/ovs/lib.log
Could you provide us the logging from /var/log/ovs/lib.log
so we could have a slight sense of what went wrong?
Best regards
Hi JeffreyDevloo
The test env is gone and I need to setup a new one. I will replicate it and give you updates.
Best regards
I have just finished a bunch of tests with 10+ snippets. For those failed remove node
operations, there're different errors for different snippets, for example in /var/log/ovs/lib.log :
AttributeError: 'PyrakoonClient' object has no attribute 'identifier'
- No connection available to node at '192.168.2.22:26400': Unable to query node "4SC9Ub2Lx1tZJjsH" to look up master
CalledProcessError: Command ''rabbitmq-server' '-detached'' returned non-zero exit status 1
ArakoonNotConnected: No connection available to node at '192.168.2.22:26400'
RuntimeError: Not all memcache nodes can be reached which is required for promoting a node.
More lines can be provided. I have to mention one point here. The snippets above were conducted with a non-complete cluster, which means there're only 2 nodes or incomplete 3rd one.
I also did some tests with well setup 3-node and 4-node cluster. Ovs remove node
were all successful. So it may be unnecessary to put efforts on snippets which do not meet minimal requirements at all .
However I'd like to go back to the snippet in the first post. It is quite fatal if the setup on second node or later node is interrupted or aborted due to some reason. As a result, new node can't be joined, trouble node can't be removed, or whole cluster has to be reinstall and setup.
I open this issue as new one because I think its scenario is different from #2149 .
I have finished installing and setting up 192.168.2.181 as the first storage router of cluster. Then I installed 192.168.2.182 as the second router but have not set up yet. Another guy was doing some weird thing on 192.168.2.182. Then I ran
ovs setup
but it failed with error:I find it was caused by abusing mode 777 to /root directory. This is not ovs' fault. It was corrected later.
However, the second router's info obviously has been stored as it shows up on UI . I retried
ovs setup
on second router (192.168.2.182) but it said this node already exists. I also tried to runovs remove node 192.168.2.182
on the first node but it failed with no detailsSo far the second router looks dangling in this cluster and it might prevent next router from joining in.