openvstorage / framework-alba-plugin

The Framework ALBA plugin extends the OpenvStorage GUI with functionality to manage ASDs (Alternate Storage Daemon) and Seagate Kinetic drives.
Other
2 stars 3 forks source link

Cachebackend ASD removal failed #539

Open matthiasdeblock opened 6 years ago

matthiasdeblock commented 6 years ago

The removal of an ASD in the cachebackend failed with the current safety is not as expected.

As this is an ASD in the cachebackend we should be able to remove the disk without any issues.

gui_error_message

Dec  8 02:28:20 SV4SRV0015 celery[21626]: 2017-12-08 02:28:20 19300 -0500 - SV4SRV0015 - 21626/140037488994048 - celery/celery.worker.strategy - 3597855 - INFO - Received task: albanode.reset_asd[f9e9c3a6-40c3-41d2-bc98-227be57c0eca]
Dec  8 02:28:20 SV4SRV0015 celery[21626]: 2017-12-08 02:28:20 19400 -0500 - SV4SRV0015 - 21626/140037488994048 - celery/celery.pool - 3597856 - DEBUG - TaskPool: Apply <function _fast_trace_task at 0x7f5d00a437d0> (args:('albanode.reset_asd', 'f9e9c3a6-40c3-41d2-bc98-227
be57c0eca', ('9e8e8f0f-cca7-4b4d-b211-636a45c284e0', u'YdX0sSC1Fh4QutBx2hr4dh4RQXD9sbIS', {u'good': 0, u'critical': 50, u'lost': 287}), {}, {'utc': True, u'is_eager': False, 'chord': None, u'group': None, 'args': ('9e8e8f0f-cca7-4b4d-b211-636a45c284e0', u'YdX0sSC1Fh4QutB
x2hr4dh4RQXD9sbIS', {u'good': 0, u'critical': 50, u'lost': 287}), 'retries': 0, u'delivery_info': {u'priority': 0, u'redelivered': False, u'routing_key': u'generic.#', u'exchange': u'generic'}, 'expires': None, u'hostname': 'celery@SV4SRV0015', 'task': 'albanode.reset_as
d', 'callbacks': None, u'correlation_id': u'f9e9c3a6-40c3-41d2-bc98-227be57c0eca', 'errbacks': None, 'timelimit': (None, None), 'taskset': None, 'kwargs': {}, 'eta': None, u'reply_to': u'a72fb530-deb9-3e35-9544-bc005731da79', 'id': 'f9e9c3a6-40c3-41d2-bc98-227be57c0eca',
 u'headers': {}}) kwargs:{})
Dec  8 02:28:20 SV4SRV0015 celery[21626]: 2017-12-08 02:28:20 19400 -0500 - SV4SRV0015 - 21626/140037488994048 - celery/celery.worker.job - 3597857 - DEBUG - Task accepted: albanode.reset_asd[f9e9c3a6-40c3-41d2-bc98-227be57c0eca] pid:426
Dec  8 02:28:20 SV4SRV0015 celery[21626]: 2017-12-08 02:28:20 20400 -0500 - SV4SRV0015 - 426/140037488994048 - lib/ensure single - 3597803 - INFO - Ensure single CHAINED mode - ID 1512718100_8wPE6IHE7y - New task albanode.remove_asd with params {'node_guid': '9e8e8f0f-cc
a7-4b4d-b211-636a45c284e0', 'expected_safety': {u'good': 0, u'critical': 50, u'lost': 287}, 'asd_id': u'YdX0sSC1Fh4QutBx2hr4dh4RQXD9sbIS'} scheduled for execution
Dec  8 02:28:20 SV4SRV0015 celery[21626]: 2017-12-08 02:28:20 20700 -0500 - SV4SRV0015 - 426/140037488994048 - lib/albanode - 3597804 - DEBUG - Removing ASD YdX0sSC1Fh4QutBx2hr4dh4RQXD9sbIS at node 172.26.16.54
Dec  8 02:28:20 SV4SRV0015 celery[21626]: 2017-12-08 02:28:20 83600 -0500 - SV4SRV0015 - 21626/140037488994048 - celery/celery.worker.job - 3597858 - ERROR - Task albanode.reset_asd[f9e9c3a6-40c3-41d2-bc98-227be57c0eca] raised unexpected: RuntimeError("Cannot remove ASD 
YdX0sSC1Fh4QutBx2hr4dh4RQXD9sbIS as the current safety is not as expected ({'good': 0, 'critical': 70, 'lost': 267} vs {u'good': 0, u'critical': 50, u'lost': 287})",)
Dec  8 02:28:20 SV4SRV0015 celery[21626]: Traceback (most recent call last):
Dec  8 02:28:20 SV4SRV0015 celery[21626]:   File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 240, in trace_task
Dec  8 02:28:20 SV4SRV0015 celery[21626]:     R = retval = fun(*args, **kwargs)
Dec  8 02:28:20 SV4SRV0015 celery[21626]:   File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 438, in __protected_call__
Dec  8 02:28:20 SV4SRV0015 celery[21626]:     return self.run(*args, **kwargs)
Dec  8 02:28:20 SV4SRV0015 celery[21626]:   File "/opt/OpenvStorage/ovs/lib/albanode.py", line 370, in reset_asd
Dec  8 02:28:20 SV4SRV0015 celery[21626]:     expected_safety=expected_safety)
Dec  8 02:28:20 SV4SRV0015 celery[21626]:   File "/usr/lib/python2.7/dist-packages/celery/local.py", line 188, in __call__
Dec  8 02:28:20 SV4SRV0015 celery[21626]:     return self._get_current_object()(*a, **kw)
Dec  8 02:28:20 SV4SRV0015 celery[21626]:   File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 439, in __protected_call__
Dec  8 02:28:20 SV4SRV0015 celery[21626]:     return orig(self, *args, **kwargs)
Dec  8 02:28:20 SV4SRV0015 celery[21626]:   File "/usr/lib/python2.7/dist-packages/celery/app/task.py", line 420, in __call__
Dec  8 02:28:20 SV4SRV0015 celery[21626]:     return self.run(*args, **kwargs)
Dec  8 02:28:20 SV4SRV0015 celery[21626]:   File "/opt/OpenvStorage/ovs/lib/helpers/decorators.py", line 460, in new_function
Dec  8 02:28:20 SV4SRV0015 celery[21626]:     output = function(*args, **kwargs)
Dec  8 02:28:20 SV4SRV0015 celery[21626]:   File "/opt/OpenvStorage/ovs/lib/albanode.py", line 312, in remove_asd
Dec  8 02:28:20 SV4SRV0015 celery[21626]:     raise RuntimeError('Cannot remove ASD {0} as the current safety is not as expected ({1} vs {2})'.format(asd_id, final_safety, expected_safety))
Dec  8 02:28:20 SV4SRV0015 celery[21626]: RuntimeError: Cannot remove ASD YdX0sSC1Fh4QutBx2hr4dh4RQXD9sbIS as the current safety is not as expected ({'good': 0, 'critical': 70, 'lost': 267} vs {u'good': 0, u'critical': 50, u'lost': 287})
wimpers commented 6 years ago

@matthiasdeblock do I get it correct that the use case was the following.

We wanted to remove all ASD from a (cache) backend. This means no namespace should be on the backend otherwise you will hit the error as above as the policy can't be met.

@JeffreyDevloo Would https://github.com/openvstorage/framework/issues/1656 not solve this issue. First re-configure the backend to not use that cache backend (which removes the namespaces) and next you should be able to remove all ASDs.

JeffreyDevloo commented 6 years ago

No, this is to be resolved within the alba plugin. The plugin should determine whether or not the backend is storage or caching. If the backend is only used for caching, allow an unsafe safety

wimpers commented 6 years ago

Instead of the plugin checking allow a force remove (checkbox) of an ASD.

The issue with the plugin checking is that in case this is an external backend, you don't have the context if it is use for cache or storage backend.

wimpers commented 6 years ago

@matthiasdeblock is there a workaround for this?

matthiasdeblock commented 6 years ago

As I had to remove all disks on the entire VM, the only possibility here was to stop the VM and then remove the disks.

matthiasdeblock commented 6 years ago

It might be a problem if you have to remove a single disk on a node for replacement.

matthiasdeblock commented 6 years ago

Just had the chance to retry this one and had the same issue while removing an SSD from a VM on Nuvolat. (@wimpers , it's not removing the entire backend but just an SSD).

Strange thing I had now was that I first had these error messages: i_understand_remove_ssd

i_understand_remove_ssd_failed_removing

Logs:

Jan 10 02:47:38 SV4SRV0015 celery[29138]: 2018-01-10 02:47:38 53900 -0500 - SV4SRV0015 - 29138/139654189233920 - celery/celery.worker.strategy - 61129 - INFO - Received task: albanode.reset_asd[5469bcbe-abbb-4a8f-b1ee-560f50ed165a]
Jan 10 02:47:38 SV4SRV0015 celery[29138]: 2018-01-10 02:47:38 55200 -0500 - SV4SRV0015 - 6074/139654189233920 - lib/ensure single - 61120 - INFO - Ensure single CHAINED mode - ID 1515570458_OYignFByD4 - Setting initial value for key ovs_ensure_single_albanode.remove_asd
Jan 10 02:47:38 SV4SRV0015 celery[29138]: 2018-01-10 02:47:38 55300 -0500 - SV4SRV0015 - 6074/139654189233920 - lib/ensure single - 61121 - INFO - Ensure single CHAINED mode - ID 1515570458_OYignFByD4 - New task albanode.remove_asd with params {'node_guid': '6c61e174-f42
e-4260-87be-11cdf0ca8407', 'expected_safety': {u'good': 0, u'critical': 378, u'lost': 0}, 'asd_id': u'dIlfVBdoeLNFvljyqz45C7mkkujpUSKp'} scheduled for execution
Jan 10 02:47:40 SV4SRV0015 celery[29138]: 2018-01-10 02:47:40 11400 -0500 - SV4SRV0015 - 6074/139654189233920 - extensions/ovs_extensions.generic.volatilemutex - 61123 - WARNING - Waited 1.46413493156 sec for lock ovs_lock_ovs_data_albabackend_f2ee0cd9-7cd0-43fd-b980-c57
e96e23e6e_local_stack
Jan 10 02:47:44 SV4SRV0015 celery[29138]: 2018-01-10 02:47:44 01000 -0500 - SV4SRV0015 - 29138/139654189233920 - celery/celery.worker.strategy - 61130 - INFO - Received task: ovs.storagerouter.get_support_info[a2daf5dc-6057-4be4-a023-95475838d5f3]
Jan 10 02:47:44 SV4SRV0015 celery[29138]: 2018-01-10 02:47:44 01100 -0500 - SV4SRV0015 - 29138/139654189233920 - celery/celery.worker.autoscale - 61131 - INFO - Scaling up 1 processes.
Jan 10 02:47:44 SV4SRV0015 celery[29138]: 2018-01-10 02:47:44 08400 -0500 - SV4SRV0015 - 29138/139654189233920 - celery/celery.worker.job - 61132 - INFO - Task ovs.storagerouter.get_support_info[a2daf5dc-6057-4be4-a023-95475838d5f3] succeeded in 0.0260210689157s: {'enabled': True, 'enablesupport': False, 'clusterid': u'JoQXf3AYDGVMBZtb', 'nodeid': 'khZcV9WLvfgQKHbP',...
Jan 10 02:48:02 SV4SRV0015 celery[29138]: 2018-01-10 02:48:02 29000 -0500 - SV4SRV0015 - 29138/139654189233920 - celery/celery.worker.strategy - 61133 - INFO - Received task: ovs.storagerouter.ping[48c3d831-20a6-4b3f-84f4-67054aa0339d]
Jan 10 02:48:02 SV4SRV0015 celery[29138]: 2018-01-10 02:48:02 33000 -0500 - SV4SRV0015 - 29138/139654189233920 - celery/celery.worker.job - 61134 - INFO - Task ovs.storagerouter.ping[48c3d831-20a6-4b3f-84f4-67054aa0339d] succeeded in 0.0388398002833s: None
Jan 10 02:48:22 SV4SRV0015 celery[29138]: 2018-01-10 02:48:22 14700 -0500 - SV4SRV0015 - 29138/139654189233920 - celery/celery.worker.autoscale - 61135 - INFO - Scaling down -1 processes.
Jan 10 02:48:41 SV4SRV0015 celery[29138]: 2018-01-10 02:48:41 27900 -0500 - SV4SRV0015 - 29138/139654189233920 - celery/celery.worker.job - 61136 - ERROR - Task albanode.reset_asd[5469bcbe-abbb-4a8f-b1ee-560f50ed165a] raised unexpected: ReadTimeout(ReadTimeoutError('None: None',),)
Jan 10 02:48:41 SV4SRV0015 celery[29138]: Traceback (most recent call last):
Jan 10 02:48:41 SV4SRV0015 celery[29138]:   File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 240, in trace_task
Jan 10 02:48:41 SV4SRV0015 celery[29138]:     R = retval = fun(*args, **kwargs)
Jan 10 02:48:41 SV4SRV0015 celery[29138]:   File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 438, in __protected_call__
Jan 10 02:48:41 SV4SRV0015 celery[29138]:     return self.run(*args, **kwargs)
Jan 10 02:48:41 SV4SRV0015 celery[29138]:   File "/opt/OpenvStorage/ovs/lib/albanode.py", line 368, in reset_asd
Jan 10 02:48:41 SV4SRV0015 celery[29138]:     expected_safety=expected_safety)
Jan 10 02:48:41 SV4SRV0015 celery[29138]:   File "/usr/lib/python2.7/dist-packages/celery/local.py", line 188, in __call__
Jan 10 02:48:41 SV4SRV0015 celery[29138]:     return self._get_current_object()(*a, **kw)
Jan 10 02:48:41 SV4SRV0015 celery[29138]:   File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 439, in __protected_call__
Jan 10 02:48:41 SV4SRV0015 celery[29138]:     return orig(self, *args, **kwargs)
Jan 10 02:48:41 SV4SRV0015 celery[29138]:   File "/usr/lib/python2.7/dist-packages/celery/app/task.py", line 420, in __call__
Jan 10 02:48:41 SV4SRV0015 celery[29138]:     return self.run(*args, **kwargs)
Jan 10 02:48:41 SV4SRV0015 celery[29138]:   File "/opt/OpenvStorage/ovs/lib/helpers/decorators.py", line 461, in new_function
Jan 10 02:48:41 SV4SRV0015 celery[29138]:     output = f(*args, **kwargs)
Jan 10 02:48:41 SV4SRV0015 celery[29138]:   File "/opt/OpenvStorage/ovs/lib/albanode.py", line 326, in remove_asd
Jan 10 02:48:41 SV4SRV0015 celery[29138]:     asd_id=asd_id)
Jan 10 02:48:41 SV4SRV0015 celery[29138]:   File "/opt/OpenvStorage/ovs/extensions/plugins/asdmanager.py", line 226, in delete_asd
Jan 10 02:48:41 SV4SRV0015 celery[29138]:     return self._call(requests.post, 'disks/{0}/asds/{1}/delete'.format(disk_id, asd_id), timeout=60)
Jan 10 02:48:41 SV4SRV0015 celery[29138]:   File "/opt/OpenvStorage/ovs/extensions/plugins/asdmanager.py", line 95, in _call
Jan 10 02:48:41 SV4SRV0015 celery[29138]:     response = method(**kwargs)
Jan 10 02:48:41 SV4SRV0015 celery[29138]:   File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 112, in post
Jan 10 02:48:41 SV4SRV0015 celery[29138]:     return request('post', url, data=data, json=json, **kwargs)
Jan 10 02:48:41 SV4SRV0015 celery[29138]:   File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 58, in request
Jan 10 02:48:41 SV4SRV0015 celery[29138]:     return session.request(method=method, url=url, **kwargs)
Jan 10 02:48:41 SV4SRV0015 celery[29138]:   File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 508, in request
Jan 10 02:48:41 SV4SRV0015 celery[29138]:     resp = self.send(prep, **send_kwargs)
Jan 10 02:48:41 SV4SRV0015 celery[29138]:   File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 618, in send
Jan 10 02:48:41 SV4SRV0015 celery[29138]:     r = adapter.send(request, **kwargs)
Jan 10 02:48:41 SV4SRV0015 celery[29138]:   File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 521, in send
Jan 10 02:48:41 SV4SRV0015 celery[29138]:     raise ReadTimeout(e, request=request)
Jan 10 02:48:41 SV4SRV0015 celery[29138]: ReadTimeout: None: None

After this, the disk got into failed state: i_understand_remove_ssd_failed_ssd

While trying to remove it again, I've received the message again that there is potential data loss: i_understand_remove_ssd_safety_check_negative

image

Logs:


Jan 10 03:01:07 SV4SRV0021 celery[17944]: 2018-01-10 03:01:07 59100 -0500 - SV4SRV0021 - 17944/139781428029184 - celery/celery.worker.strategy - 72368 - INFO - Received task: albanode.reset_asd[024a093b-71a8-4e81-bb55-06321e4c9a1e]
Jan 10 03:01:07 SV4SRV0021 celery[17944]: 2018-01-10 03:01:07 60400 -0500 - SV4SRV0021 - 8414/139781428029184 - lib/ensure single - 72353 - INFO - Ensure single CHAINED mode - ID 1515571267_K38LIfgET4 - New task albanode.remove_asd with params {'node_guid': '6c61e174-f42e-4260-87be-11cdf0ca8407', 'expected_safety': {u'good': 366, u'critical': 4, u'lost': 8}, 'asd_id': u'dIlfVBdoeLNFvljyqz45C7mkkujpUSKp'} scheduled for execution
Jan 10 03:01:08 SV4SRV0021 celery[17944]: 2018-01-10 03:01:08 72300 -0500 - SV4SRV0021 - 8414/139781428029184 - celery/celery.redirected - 72356 - WARNING - 2018-01-10 03:01:08 72300 -0500 - SV4SRV0021 - 8414/139781428029184 - extensions/alba-cli - 72355 - WARNING - AlbaCLI call get-disk-safety took 0.97s
Jan 10 03:01:08 SV4SRV0021 celery[17944]: 2018-01-10 03:01:08 74100 -0500 - SV4SRV0021 - 17944/139781428029184 - celery/celery.worker.job - 72369 - ERROR - Task albanode.reset_asd[024a093b-71a8-4e81-bb55-06321e4c9a1e] raised unexpected: RuntimeError("Cannot remove ASD dIlfVBdoeLNFvljyqz45C7mkkujpUSKp as the current safety is not as expected ({'good': 365, 'critical': 5, 'lost': 8} vs {u'good': 366, u'critical': 4, u'lost': 8})",)
Jan 10 03:01:08 SV4SRV0021 celery[17944]: Traceback (most recent call last):
Jan 10 03:01:08 SV4SRV0021 celery[17944]:   File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 240, in trace_task
Jan 10 03:01:08 SV4SRV0021 celery[17944]:     R = retval = fun(*args, **kwargs)
Jan 10 03:01:08 SV4SRV0021 celery[17944]:   File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 438, in __protected_call__
Jan 10 03:01:08 SV4SRV0021 celery[17944]:     return self.run(*args, **kwargs)
Jan 10 03:01:08 SV4SRV0021 celery[17944]:   File "/opt/OpenvStorage/ovs/lib/albanode.py", line 368, in reset_asd
Jan 10 03:01:08 SV4SRV0021 celery[17944]:     expected_safety=expected_safety)
Jan 10 03:01:08 SV4SRV0021 celery[17944]:   File "/usr/lib/python2.7/dist-packages/celery/local.py", line 188, in __call__
Jan 10 03:01:08 SV4SRV0021 celery[17944]:     return self._get_current_object()(*a, **kw)
Jan 10 03:01:08 SV4SRV0021 celery[17944]:   File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 439, in __protected_call__
Jan 10 03:01:08 SV4SRV0021 celery[17944]:     return orig(self, *args, **kwargs)
Jan 10 03:01:08 SV4SRV0021 celery[17944]:   File "/usr/lib/python2.7/dist-packages/celery/app/task.py", line 420, in __call__
Jan 10 03:01:08 SV4SRV0021 celery[17944]:     return self.run(*args, **kwargs)
Jan 10 03:01:08 SV4SRV0021 celery[17944]:   File "/opt/OpenvStorage/ovs/lib/helpers/decorators.py", line 461, in new_function
Jan 10 03:01:08 SV4SRV0021 celery[17944]:     output = f(*args, **kwargs)
Jan 10 03:01:08 SV4SRV0021 celery[17944]:   File "/opt/OpenvStorage/ovs/lib/albanode.py", line 310, in remove_asd
Jan 10 03:01:08 SV4SRV0021 celery[17944]:     raise RuntimeError('Cannot remove ASD {0} as the current safety is not as expected ({1} vs {2})'.format(asd_id, final_safety, expected_safety))
Jan 10 03:01:08 SV4SRV0021 celery[17944]: RuntimeError: Cannot remove ASD dIlfVBdoeLNFvljyqz45C7mkkujpUSKp as the current safety is not as expected ({'good': 365, 'critical': 5, 'lost': 8} vs {u'good': 366, u'critical': 4, u'lost': 8})

As this is a removal of the entire VM from the SSD storage, we can use the previous mentioned 'workaround'. As in shut down the VM and then remove the ASD's.

JeffreyDevloo commented 6 years ago

The problem is, like @wimpers mentioned is that there is no way to detect in what context these ASDs are used... If there is a disk replacement to be done, it should be possible to clear a slot at once which takes in the factor of losing all it's osds on it.