openvstorage / framework

The Framework is a set of components and tools which brings the user an interface (GUI / API) to setup, extend and manage an Open vStorage platform.
Other
27 stars 23 forks source link

Attempting to remove an offline node from the cluster raises RuntimeError when there's a vPool running exclusively on said node #1905

Open ovstmoens opened 6 years ago

ovstmoens commented 6 years ago

On DEV-4N-199-211 environment I was trying to reproduce the error mentioned in https://github.com/openvstorage/operations/issues/25.

After powering off one of the nodes, I've marked the node offline in the GUI and tried to remove the node from the cluster via CLI:

root@DEV-4N-199-211:~# ovs remove node 10.100.199.212
No handlers could be found for logger "ovs_extensions.db.arakoon.pyrakoon.client"
+++++++++++++++++++++
+++  Remove node  +++
+++++++++++++++++++++
WARNING: Some of these steps may take a very long time, please check the logs for more information

Creating SSH connections to remaining master nodes
  * Node with IP 10.100.199.211 - Successfully connected
  * Node with IP 10.100.199.212 - Unable to connect
  * Node with IP 10.100.199.213 - Successfully connected
  * Node with IP 10.100.199.214 - Successfully connected

+++ Running "noderemoval - validate_removal" hooks +++

Executing alba._validate_removal
Certain nodes appear to be offline. These will not fully removed and will cause issues if they are not really offline.
Offline nodes: 
  * 10.100.199.212 - Unable to connect.
Continue the removal with these being presumably offline? (y/[n]): y
Are you sure you want to remove node DEV-4N-199-212? (y/[n]): y
The removal of this StorageRouter will cause data loss on Backend mybackend
The removal of this StorageRouter will cause data loss on Backend mybackend02
Are you sure you want to continue? (y/[n]): y
Starting removal of node DEV-4N-199-212 - 10.100.199.212
  Marking all Storage Drivers served by Storage Router 10.100.199.212 as offline
  Removing vPools from node
    Removing vPool myvpool02 from node

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++  An unexpected error occurred:                      +++
+++  Could not find any responsive node in the cluster  +++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

/var/log/ovs/lib.log shows the following information:

2017-12-20 10:31:05 06300 +0100 - DEV-4N-199-211 - 10846/139969271932672 - lib/node-removal - 45 - INFO - Remove node
2017-12-20 10:31:05 06300 +0100 - DEV-4N-199-211 - 10846/139969271932672 - lib/node-removal - 46 - INFO - WARNING: Some of these steps may take a very long time, please check the logs for more information

2017-12-20 10:31:17 57300 +0100 - DEV-4N-199-211 - 10846/139969271932672 - lib/node-removal - 57 - INFO - Creating SSH connections to remaining master nodes
2017-12-20 10:31:17 60100 +0100 - DEV-4N-199-211 - 10846/139969271932672 - lib/node-removal - 58 - INFO -   * Node with IP 10.100.199.211 - Successfully connected
2017-12-20 10:31:17 60200 +0100 - DEV-4N-199-211 - 10846/139969271932672 - lib/node-removal - 59 - INFO -   * Node with IP 10.100.199.212 - Unable to connect
2017-12-20 10:31:17 83900 +0100 - DEV-4N-199-211 - 10846/139969271932672 - lib/node-removal - 60 - INFO -   * Node with IP 10.100.199.213 - Successfully connected
2017-12-20 10:31:18 04800 +0100 - DEV-4N-199-211 - 10846/139969271932672 - lib/node-removal - 61 - INFO -   * Node with IP 10.100.199.214 - Successfully connected
2017-12-20 10:32:17 14100 +0100 - DEV-4N-199-211 - 10846/139969271932672 - lib/node-removal - 112 - INFO - Running "noderemoval - validate_removal" hooks
2017-12-20 10:32:17 14100 +0100 - DEV-4N-199-211 - 10846/139969271932672 - lib/node-removal - 113 - INFO - Executing alba._validate_removal
2017-12-20 10:32:17 76800 +0100 - DEV-4N-199-211 - 10846/139969271932672 - lib/node-removal - 114 - INFO - Certain nodes appear to be offline. These will not fully removed and will cause issues if they are not really offline.
2017-12-20 10:32:17 76800 +0100 - DEV-4N-199-211 - 10846/139969271932672 - lib/node-removal - 115 - INFO - Offline nodes: 
  * 10.100.199.212 - Unable to connect.
2017-12-20 10:33:16 32400 +0100 - DEV-4N-199-211 - 10846/139969271932672 - lib/node-removal - 140 - INFO - Starting removal of node DEV-4N-199-212 - 10.100.199.212
2017-12-20 10:33:16 32400 +0100 - DEV-4N-199-211 - 10846/139969271932672 - lib/node-removal - 141 - INFO -   Marking all Storage Drivers served by Storage Router 10.100.199.212 as offline
2017-12-20 10:33:16 32900 +0100 - DEV-4N-199-211 - 10846/139969271932672 - lib/node-removal - 142 - INFO -   Removing vPools from node
2017-12-20 10:33:16 33000 +0100 - DEV-4N-199-211 - 10846/139969271932672 - lib/node-removal - 143 - INFO -     Removing vPool myvpool02 from node
2017-12-20 10:33:16 33100 +0100 - DEV-4N-199-211 - 10846/139969271932672 - lib/storagerouter - 144 - INFO - Remove StorageDriver - Guid a232cb14-bf9c-4591-a624-c24a1e718607 - Deleting StorageDriver myvpool02wIM5srbjm5g8MPfO
2017-12-20 10:33:17 21500 +0100 - DEV-4N-199-211 - 10846/139969271932672 - lib/storagerouter - 145 - INFO - Remove StorageDriver - Guid a232cb14-bf9c-4591-a624-c24a1e718607 - Checking availability of related StorageRouters
2017-12-20 10:33:17 22200 +0100 - DEV-4N-199-211 - 10846/139969271932672 - lib/storagerouter - 146 - WARNING - Remove StorageDriver - Guid a232cb14-bf9c-4591-a624-c24a1e718607 - StorageRouter DEV-4N-199-212 with IP 10.100.199.212 is offline
2017-12-20 10:33:17 22300 +0100 - DEV-4N-199-211 - 10846/139969271932672 - lib/node-removal - 147 - INFO - 

2017-12-20 10:33:17 22300 +0100 - DEV-4N-199-211 - 10846/139969271932672 - lib/node-removal - 148 - ERROR - An unexpected error occurred:
Traceback (most recent call last):
  File "ovs/lib/noderemoval.py", line 214, in remove_node
    offline_storage_router_guids=storage_routers_offline_guids)
  File "/usr/lib/python2.7/dist-packages/celery/local.py", line 188, in __call__
    return self._get_current_object()(*a, **kw)
  File "/usr/lib/python2.7/dist-packages/celery/app/task.py", line 420, in __call__
    return self.run(*args, **kwargs)
  File "ovs/lib/helpers/..//storagerouter.py", line 1080, in remove_storagedriver
    raise RuntimeError('Could not find any responsive node in the cluster')
RuntimeError: Could not find any responsive node in the cluster
2017-12-20 10:33:17 22400 +0100 - DEV-4N-199-211 - 10846/139969271932672 - lib/node-removal - 149 - ERROR - Could not find any responsive node in the cluster
Traceback (most recent call last):
  File "ovs/lib/noderemoval.py", line 214, in remove_node
    offline_storage_router_guids=storage_routers_offline_guids)
  File "/usr/lib/python2.7/dist-packages/celery/local.py", line 188, in __call__
    return self._get_current_object()(*a, **kw)
  File "/usr/lib/python2.7/dist-packages/celery/app/task.py", line 420, in __call__
    return self.run(*args, **kwargs)
  File "ovs/lib/helpers/..//storagerouter.py", line 1080, in remove_storagedriver
    raise RuntimeError('Could not find any responsive node in the cluster')
RuntimeError: Could not find any responsive node in the cluster
ovstmoens commented 6 years ago

I don't know if this is a side effect from previous attempt to remove the node or not, but I've noticed there was a vPool running only on this particular node. After deleting the vPool the demote fails again further along the process with a different RuntimeError.

Output:

No handlers could be found for logger "ovs_extensions.db.arakoon.pyrakoon.client"
+++++++++++++++++++++
+++  Remove node  +++
+++++++++++++++++++++
WARNING: Some of these steps may take a very long time, please check the logs for more information

Creating SSH connections to remaining master nodes
  * Node with IP 10.100.199.211 - Successfully connected
  * Node with IP 10.100.199.212 - Unable to connect
  * Node with IP 10.100.199.213 - Successfully connected
  * Node with IP 10.100.199.214 - Successfully connected

+++ Running "noderemoval - validate_removal" hooks +++

Executing alba._validate_removal
Certain nodes appear to be offline. These will not fully removed and will cause issues if they are not really offline.
Offline nodes: 
  * 10.100.199.212 - Unable to connect.
Continue the removal with these being presumably offline? (y/[n]): y
Are you sure you want to remove node DEV-4N-199-212? (y/[n]): y
The removal of this StorageRouter will cause data loss on Backend mybackend
The removal of this StorageRouter will cause data loss on Backend mybackend02
Are you sure you want to continue? (y/[n]): y
Starting removal of node DEV-4N-199-212 - 10.100.199.212
  Marking all Storage Drivers served by Storage Router 10.100.199.212 as offline
  Removing vPools from node

+++ Demoting node +++

Leaving Arakoon ovsdb cluster
Leaving Arakoon config cluster
Update configurations
Restarting master node services
Removing/unconfiguring offline RabbitMQ node
Update existing vPools
Restarting services
  [10.100.199.211] watcher-framework stopped
  [10.100.199.213] watcher-framework stopped
  [10.100.199.214] watcher-framework stopped
  [10.100.199.211] memcached restarted
  [10.100.199.213] memcached restarted
  [10.100.199.211] watcher-framework started
  [10.100.199.211] support-agent restarted
  [10.100.199.213] watcher-framework started
  [10.100.199.213] support-agent restarted
  [10.100.199.214] watcher-framework started
  [10.100.199.214] support-agent restarted

+++ Running "nodetype - demote" hooks +++

Executing alba._on_demote

+++++++++++++++++++++++++++++++++++++++
+++  An unexpected error occurred:  +++
+++  Config not yet loaded          +++
+++++++++++++++++++++++++++++++++++++++

/var/log/ovs/lib.log:

2017-12-20 12:11:09 25300 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/node-removal - 50 - INFO - Remove node
2017-12-20 12:11:09 25400 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/node-removal - 51 - INFO - WARNING: Some of these steps may take a very long time, please check the logs for more information

2017-12-20 12:11:27 80200 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/node-removal - 67 - INFO - Creating SSH connections to remaining master nodes
2017-12-20 12:11:27 82500 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/node-removal - 68 - INFO -   * Node with IP 10.100.199.211 - Successfully connected
2017-12-20 12:11:27 82700 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/node-removal - 69 - INFO -   * Node with IP 10.100.199.212 - Unable to connect
2017-12-20 12:11:28 08000 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/node-removal - 70 - INFO -   * Node with IP 10.100.199.213 - Successfully connected
2017-12-20 12:11:28 31100 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/node-removal - 71 - INFO -   * Node with IP 10.100.199.214 - Successfully connected
2017-12-20 12:12:21 29400 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/node-removal - 117 - INFO - Running "noderemoval - validate_removal" hooks
2017-12-20 12:12:21 29400 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/node-removal - 118 - INFO - Executing alba._validate_removal
2017-12-20 12:12:21 89700 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/node-removal - 119 - INFO - Certain nodes appear to be offline. These will not fully removed and will cause issues if they are not really offline.
2017-12-20 12:12:21 89700 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/node-removal - 120 - INFO - Offline nodes: 
  * 10.100.199.212 - Unable to connect.
2017-12-20 12:13:11 55800 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/node-removal - 138 - INFO - Starting removal of node DEV-4N-199-212 - 10.100.199.212
2017-12-20 12:13:11 55800 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/node-removal - 139 - INFO -   Marking all Storage Drivers served by Storage Router 10.100.199.212 as offline
2017-12-20 12:13:11 56100 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/node-removal - 140 - INFO -   Removing vPools from node
2017-12-20 12:13:11 56200 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/node-type - 141 - INFO - Demoting node
2017-12-20 12:13:21 37600 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/node-type - 152 - INFO - Leaving Arakoon ovsdb cluster
2017-12-20 12:13:42 47400 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/node-type - 188 - INFO - Leaving Arakoon config cluster
2017-12-20 12:13:48 28500 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/node-type - 209 - INFO - Update configurations
2017-12-20 12:13:54 54600 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/node-type - 220 - INFO - Restarting master node services
2017-12-20 12:13:54 61100 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/node-type - 221 - INFO - Removing/unconfiguring offline RabbitMQ node
2017-12-20 12:13:55 37300 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/node-type - 222 - INFO - Update existing vPools
2017-12-20 12:13:55 40800 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/storagedriver - 223 - INFO - Applying local storagedriver configuration changes
2017-12-20 12:13:57 53700 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/storagedriver - 224 - INFO - Changed events_amqp_uris from "[amqp://ovs:0penv5tor4ge@10.100.199.212:5672, amqp://ovs:0penv5tor4ge@10.100.199.211:5672, amqp://ovs:0penv5tor4ge@10.100.199.213:5672]" to "[amqp://ovs:0penv5tor4ge@10.100.199.211:5672, amqp://ovs:0penv5tor4ge@10.100.199.213:5672]"
2017-12-20 12:13:57 53800 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/storagedriver - 225 - INFO - Changes applied
2017-12-20 12:13:57 57400 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/storagedriver - 226 - INFO - Applying local storagedriver configuration changes
2017-12-20 12:14:00 53000 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/storagedriver - 227 - INFO - Changed events_amqp_uris from "[amqp://ovs:0penv5tor4ge@10.100.199.212:5672, amqp://ovs:0penv5tor4ge@10.100.199.211:5672, amqp://ovs:0penv5tor4ge@10.100.199.213:5672]" to "[amqp://ovs:0penv5tor4ge@10.100.199.211:5672, amqp://ovs:0penv5tor4ge@10.100.199.213:5672]"
2017-12-20 12:14:00 53100 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/storagedriver - 228 - INFO - Changes applied
2017-12-20 12:14:00 54100 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/storagedriver - 229 - INFO - Applying local storagedriver configuration changes
2017-12-20 12:14:02 78500 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/storagedriver - 230 - INFO - Changed events_amqp_uris from "[amqp://ovs:0penv5tor4ge@10.100.199.212:5672, amqp://ovs:0penv5tor4ge@10.100.199.211:5672, amqp://ovs:0penv5tor4ge@10.100.199.213:5672]" to "[amqp://ovs:0penv5tor4ge@10.100.199.211:5672, amqp://ovs:0penv5tor4ge@10.100.199.213:5672]"
2017-12-20 12:14:02 78600 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/storagedriver - 231 - INFO - Changes applied
2017-12-20 12:14:02 79100 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/node-type - 232 - INFO - Restarting services
2017-12-20 12:14:19 81700 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/node-type - 255 - INFO - Running "nodetype - demote" hooks
2017-12-20 12:14:19 81700 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/node-type - 256 - INFO - Executing alba._on_demote
2017-12-20 12:14:19 85600 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/alba - 257 - INFO - Shrinking ABM for Backend mybackend
2017-12-20 12:14:19 87000 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/alba - 258 - INFO - * Shrink ABM cluster
2017-12-20 12:14:23 58800 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/alba - 279 - INFO - * Updating ABM client config
2017-12-20 12:14:24 02800 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/alba - 281 - INFO - * Remove old ABM node from model
2017-12-20 12:14:24 04800 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/alba - 282 - INFO - Shrinking NSM for Backend mybackend
2017-12-20 12:14:24 06200 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/alba - 283 - INFO - * Shrink NSM cluster mybackend-nsm_0
2017-12-20 12:14:24 06200 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/node-removal - 284 - INFO - 

2017-12-20 12:14:24 06300 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/node-removal - 285 - ERROR - An unexpected error occurred:
Traceback (most recent call last):
  File "ovs/lib/noderemoval.py", line 224, in remove_node
    offline_nodes=storage_routers_offline)
  File "ovs/lib/nodetype.py", line 501, in demote_node
    offline_node_ips=offline_node_ips):
  File "ovs/lib/helpers/toolbox.py", line 104, in run_hooks
    fct(**kwargs)
  File "ovs/lib/helpers/..//alba.py", line 659, in _on_demote
    offline_nodes=offline_node_ips)
  File "/usr/lib/python2.7/dist-packages/ovs_extensions/db/arakoon/arakooninstaller.py", line 588, in shrink_cluster
    raise RuntimeError('Config not yet loaded')
RuntimeError: Config not yet loaded
2017-12-20 12:14:24 06300 +0100 - DEV-4N-199-211 - 12259/140046373598976 - lib/node-removal - 286 - ERROR - Config not yet loaded
Traceback (most recent call last):
  File "ovs/lib/noderemoval.py", line 224, in remove_node
    offline_nodes=storage_routers_offline)
  File "ovs/lib/nodetype.py", line 501, in demote_node
    offline_node_ips=offline_node_ips):
  File "ovs/lib/helpers/toolbox.py", line 104, in run_hooks
    fct(**kwargs)
  File "ovs/lib/helpers/..//alba.py", line 659, in _on_demote
    offline_nodes=offline_node_ips)
  File "/usr/lib/python2.7/dist-packages/ovs_extensions/db/arakoon/arakooninstaller.py", line 588, in shrink_cluster
    raise RuntimeError('Config not yet loaded')
RuntimeError: Config not yet loaded
ovstmoens commented 6 years ago

The node where this issue occurred was the only node for vPool "myvpool02" which probably caused this error, on other nodes in this environment the issue did not occur. It also doesn't occur when you remove the vPool manually first.

vPools running on a single node aren't exactly a real life use case so maybe it's not worth the time to make it possible but perhaps we could make the output a bit clearer.

yongshengma commented 6 years ago

The node where this issue occurred was the only node for vPool "myvpool02" which probably caused this error, on other nodes in this environment the issue did not occur. It also doesn't occur when you remove the vPool manually first.

I got 'similar' issue even if I removed the vPool manually first. I didn't keep the trackback, so that's why i say similar here.