can't remove a router if its setup aborts unexpectedly

yongshengma commented 6 years ago

I open this issue as new one because I think its scenario is different from #2149 .

I have finished installing and setting up 192.168.2.181 as the first storage router of cluster. Then I installed 192.168.2.182 as the second router but have not set up yet. Another guy was doing some weird thing on 192.168.2.182. Then I ran ovs setup but it failed with error:

Configuring/updating model
root@192.168.2.182's password: 
root@192.168.2.182's password: 
root@192.168.2.182's password: 
ERROR: Failed to setup extra node
ERROR: Command line: [u'/usr/bin/ssh', u'root@192.168.2.182', u'cd', u'/root', u'&&', u'/usr/bin/python2.7', u'/root/tmp.NMakYxqDhg/deployed-rpyc.py']
Exit code: 255
Stderr:  | ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
         | Permission denied, please try again.
         | ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
         | Permission denied, please try again.
         | ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
         | Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++  An unexpected error occurred:                                                    +++
+++  Command line: [u'/usr/bin/ssh', u'root@192.168.2.182', u'cd', u'/root', u'&&',   +++
+++  u'/usr/bin/python2.7', u'/root/tmp.NMakYxqDhg/deployed-rpyc.py']                 +++
+++  Exit code: 255                                                                   +++
+++  Stderr:  | ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or  +++
+++  directory                                                                        +++
+++           | Permission denied, please try again.                                  +++
+++  | ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or           +++
+++  directory                                                                        +++
+++           | Permission denied, please try again.                                  +++
+++  | ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or           +++
+++  directory                                                                        +++
+++           | Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).  +++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

I find it was caused by abusing mode 777 to /root directory. This is not ovs' fault. It was corrected later.

However, the second router's info obviously has been stored as it shows up on UI . I retried ovs setup on second router (192.168.2.182) but it said this node already exists. I also tried to run ovs remove node 192.168.2.182 on the first node but it failed with no details

[root@test-1 ~]# ovs remove node 192.168.2.182
+++++++++++++++++++++
+++  Remove node  +++
+++++++++++++++++++++
WARNING: Some of these steps may take a very long time, please check the logs for more information

Creating SSH connections to remaining master nodes
  * Node with IP 192.168.2.181  - Successfully connected
  * Node with IP 192.168.2.182  - Successfully connected

+++ Running "noderemoval - validate_removal" hooks +++

Executing alba._validate_removal
Are you sure you want to remove node test-2? (y/[n]): y
Starting removal of node test-2 - 192.168.2.182
  Removing vPools from node
Stopping and removing services
Removing services
Removing service workers
Removing service support-agent
Removing service watcher-framework
Removing service watcher-config

+++ Running "noderemoval - remove" hooks +++

Executing storagedriver._on_remove
Executing alba._on_remove
Removing node from model
  [192.168.2.181] watcher-framework stopped
  [192.168.2.181] memcached restarted
  [192.168.2.181] watcher-framework started
  [192.168.2.181] support-agent restarted

+++++++++++++++++++++++++++++++++++++++
+++  An unexpected error occurred:  +++
+++++++++++++++++++++++++++++++++++++++

So far the second router looks dangling in this cluster and it might prevent next router from joining in.

yongshengma commented 6 years ago

That's weird. I tested ovs remove node and met the error again

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++  An unexpected error occurred:                                                  +++
+++  Command line: ['/usr/bin/ssh', 'root@192.168.2.22', 'cd', '/root', '&&',       +++
+++  '/usr/bin/python2.7', '/tmp/tmp.DAUobc1IaY/deployed-rpyc.py']                  +++
+++  Exit code: 255                                                                 +++
+++  Stderr:  | ssh_askpass: exec(/usr/bin/ssh-askpass): No such file or directory  +++
+++           | Permission denied, please try again.                                +++
+++           | ssh_askpass: exec(/usr/bin/ssh-askpass): No such file or directory  +++
+++           | Permission denied, please try again.                                +++
+++           | ssh_askpass: exec(/usr/bin/ssh-askpass): No such file or directory  +++
+++           | Permission denied (publickey,password).                             +++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

I find the node can't ssh itself by public key authentication after i ran ovs remove node ... on it.

Before this error, I have ever failed to run ovs remove node once, which output didn't give details.

root@NODE-22:~# ovs remove node 192.168.2.21
+++++++++++++++++++++
+++  Remove node  +++
+++++++++++++++++++++
WARNING: Some of these steps may take a very long time, please check the logs for more information

Creating SSH connections to remaining master nodes
  Node with IP 192.168.2.21    successfully connected to
  Node with IP 192.168.2.22    successfully connected to
  Node with IP 192.168.2.24    successfully connected to

+++ Running "noderemoval - validate_removal" hooks +++

Executing alba._validate_removal
Are you sure you want to remove node NODE-21? (y/[n]): y
Do you also want to remove the ASD manager and related ASDs? (y/[n]): y
Starting removal of node NODE-21 - 192.168.2.21
  Removing vPools from node
    Removing vPool pool-1 from node
root@192.168.2.22's password: 

+++++++++++++++++++++++++++++++++++++++
+++  An unexpected error occurred:  +++
+++++++++++++++++++++++++++++++++++++++

JeffreyDevloo commented 6 years ago

Hi yongshengma

Let's talk about the first snippet: I have a slight suspicion that the ssh-askpass package was removed (ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory). This is a dependency of the rpyc library that we are using. We use it to connect to a master node during the setup to distribute all public ssh keys.

Last snippet: I've noticed root@192.168.2.22's password: in the last snippet. This indicates that the current host (NODE-22) might not be able to connect to itself for some reason while previously it could (Node with IP 192.168.2.22 successfully connected to). This is a curious case...

Failures on node removal and node installation can be found in /var/log/ovs/lib.log Could you provide us the logging from /var/log/ovs/lib.log so we could have a slight sense of what went wrong?

Best regards

yongshengma commented 6 years ago

Hi JeffreyDevloo

The test env is gone and I need to setup a new one. I will replicate it and give you updates.

Best regards

yongshengma commented 6 years ago

I have just finished a bunch of tests with 10+ snippets. For those failed remove node operations, there're different errors for different snippets, for example in /var/log/ovs/lib.log :


AttributeError: 'PyrakoonClient' object has no attribute 'identifier'

- No connection available to node at '192.168.2.22:26400': Unable to query node "4SC9Ub2Lx1tZJjsH" to look up master

CalledProcessError: Command ''rabbitmq-server' '-detached'' returned non-zero exit status 1

ArakoonNotConnected: No connection available to node at '192.168.2.22:26400'

RuntimeError: Not all memcache nodes can be reached which is required for promoting a node.

More lines can be provided. I have to mention one point here. The snippets above were conducted with a non-complete cluster, which means there're only 2 nodes or incomplete 3rd one.

I also did some tests with well setup 3-node and 4-node cluster. Ovs remove node were all successful. So it may be unnecessary to put efforts on snippets which do not meet minimal requirements at all .

However I'd like to go back to the snippet in the first post. It is quite fatal if the setup on second node or later node is interrupted or aborted due to some reason. As a result, new node can't be joined, trouble node can't be removed, or whole cluster has to be reinstall and setup.

openvstorage / framework

can't remove a router if its setup aborts unexpectedly #2152