Open matthiasdeblock opened 6 years ago
@matthiasdeblock do I get it correct that the use case was the following.
We wanted to remove all ASD from a (cache) backend. This means no namespace should be on the backend otherwise you will hit the error as above as the policy can't be met.
@JeffreyDevloo Would https://github.com/openvstorage/framework/issues/1656 not solve this issue. First re-configure the backend to not use that cache backend (which removes the namespaces) and next you should be able to remove all ASDs.
No, this is to be resolved within the alba plugin. The plugin should determine whether or not the backend is storage or caching. If the backend is only used for caching, allow an unsafe safety
Instead of the plugin checking allow a force remove (checkbox) of an ASD.
The issue with the plugin checking is that in case this is an external backend, you don't have the context if it is use for cache or storage backend.
@matthiasdeblock is there a workaround for this?
As I had to remove all disks on the entire VM, the only possibility here was to stop the VM and then remove the disks.
It might be a problem if you have to remove a single disk on a node for replacement.
Just had the chance to retry this one and had the same issue while removing an SSD from a VM on Nuvolat. (@wimpers , it's not removing the entire backend but just an SSD).
Strange thing I had now was that I first had these error messages:
Logs:
Jan 10 02:47:38 SV4SRV0015 celery[29138]: 2018-01-10 02:47:38 53900 -0500 - SV4SRV0015 - 29138/139654189233920 - celery/celery.worker.strategy - 61129 - INFO - Received task: albanode.reset_asd[5469bcbe-abbb-4a8f-b1ee-560f50ed165a]
Jan 10 02:47:38 SV4SRV0015 celery[29138]: 2018-01-10 02:47:38 55200 -0500 - SV4SRV0015 - 6074/139654189233920 - lib/ensure single - 61120 - INFO - Ensure single CHAINED mode - ID 1515570458_OYignFByD4 - Setting initial value for key ovs_ensure_single_albanode.remove_asd
Jan 10 02:47:38 SV4SRV0015 celery[29138]: 2018-01-10 02:47:38 55300 -0500 - SV4SRV0015 - 6074/139654189233920 - lib/ensure single - 61121 - INFO - Ensure single CHAINED mode - ID 1515570458_OYignFByD4 - New task albanode.remove_asd with params {'node_guid': '6c61e174-f42
e-4260-87be-11cdf0ca8407', 'expected_safety': {u'good': 0, u'critical': 378, u'lost': 0}, 'asd_id': u'dIlfVBdoeLNFvljyqz45C7mkkujpUSKp'} scheduled for execution
Jan 10 02:47:40 SV4SRV0015 celery[29138]: 2018-01-10 02:47:40 11400 -0500 - SV4SRV0015 - 6074/139654189233920 - extensions/ovs_extensions.generic.volatilemutex - 61123 - WARNING - Waited 1.46413493156 sec for lock ovs_lock_ovs_data_albabackend_f2ee0cd9-7cd0-43fd-b980-c57
e96e23e6e_local_stack
Jan 10 02:47:44 SV4SRV0015 celery[29138]: 2018-01-10 02:47:44 01000 -0500 - SV4SRV0015 - 29138/139654189233920 - celery/celery.worker.strategy - 61130 - INFO - Received task: ovs.storagerouter.get_support_info[a2daf5dc-6057-4be4-a023-95475838d5f3]
Jan 10 02:47:44 SV4SRV0015 celery[29138]: 2018-01-10 02:47:44 01100 -0500 - SV4SRV0015 - 29138/139654189233920 - celery/celery.worker.autoscale - 61131 - INFO - Scaling up 1 processes.
Jan 10 02:47:44 SV4SRV0015 celery[29138]: 2018-01-10 02:47:44 08400 -0500 - SV4SRV0015 - 29138/139654189233920 - celery/celery.worker.job - 61132 - INFO - Task ovs.storagerouter.get_support_info[a2daf5dc-6057-4be4-a023-95475838d5f3] succeeded in 0.0260210689157s: {'enabled': True, 'enablesupport': False, 'clusterid': u'JoQXf3AYDGVMBZtb', 'nodeid': 'khZcV9WLvfgQKHbP',...
Jan 10 02:48:02 SV4SRV0015 celery[29138]: 2018-01-10 02:48:02 29000 -0500 - SV4SRV0015 - 29138/139654189233920 - celery/celery.worker.strategy - 61133 - INFO - Received task: ovs.storagerouter.ping[48c3d831-20a6-4b3f-84f4-67054aa0339d]
Jan 10 02:48:02 SV4SRV0015 celery[29138]: 2018-01-10 02:48:02 33000 -0500 - SV4SRV0015 - 29138/139654189233920 - celery/celery.worker.job - 61134 - INFO - Task ovs.storagerouter.ping[48c3d831-20a6-4b3f-84f4-67054aa0339d] succeeded in 0.0388398002833s: None
Jan 10 02:48:22 SV4SRV0015 celery[29138]: 2018-01-10 02:48:22 14700 -0500 - SV4SRV0015 - 29138/139654189233920 - celery/celery.worker.autoscale - 61135 - INFO - Scaling down -1 processes.
Jan 10 02:48:41 SV4SRV0015 celery[29138]: 2018-01-10 02:48:41 27900 -0500 - SV4SRV0015 - 29138/139654189233920 - celery/celery.worker.job - 61136 - ERROR - Task albanode.reset_asd[5469bcbe-abbb-4a8f-b1ee-560f50ed165a] raised unexpected: ReadTimeout(ReadTimeoutError('None: None',),)
Jan 10 02:48:41 SV4SRV0015 celery[29138]: Traceback (most recent call last):
Jan 10 02:48:41 SV4SRV0015 celery[29138]: File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 240, in trace_task
Jan 10 02:48:41 SV4SRV0015 celery[29138]: R = retval = fun(*args, **kwargs)
Jan 10 02:48:41 SV4SRV0015 celery[29138]: File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 438, in __protected_call__
Jan 10 02:48:41 SV4SRV0015 celery[29138]: return self.run(*args, **kwargs)
Jan 10 02:48:41 SV4SRV0015 celery[29138]: File "/opt/OpenvStorage/ovs/lib/albanode.py", line 368, in reset_asd
Jan 10 02:48:41 SV4SRV0015 celery[29138]: expected_safety=expected_safety)
Jan 10 02:48:41 SV4SRV0015 celery[29138]: File "/usr/lib/python2.7/dist-packages/celery/local.py", line 188, in __call__
Jan 10 02:48:41 SV4SRV0015 celery[29138]: return self._get_current_object()(*a, **kw)
Jan 10 02:48:41 SV4SRV0015 celery[29138]: File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 439, in __protected_call__
Jan 10 02:48:41 SV4SRV0015 celery[29138]: return orig(self, *args, **kwargs)
Jan 10 02:48:41 SV4SRV0015 celery[29138]: File "/usr/lib/python2.7/dist-packages/celery/app/task.py", line 420, in __call__
Jan 10 02:48:41 SV4SRV0015 celery[29138]: return self.run(*args, **kwargs)
Jan 10 02:48:41 SV4SRV0015 celery[29138]: File "/opt/OpenvStorage/ovs/lib/helpers/decorators.py", line 461, in new_function
Jan 10 02:48:41 SV4SRV0015 celery[29138]: output = f(*args, **kwargs)
Jan 10 02:48:41 SV4SRV0015 celery[29138]: File "/opt/OpenvStorage/ovs/lib/albanode.py", line 326, in remove_asd
Jan 10 02:48:41 SV4SRV0015 celery[29138]: asd_id=asd_id)
Jan 10 02:48:41 SV4SRV0015 celery[29138]: File "/opt/OpenvStorage/ovs/extensions/plugins/asdmanager.py", line 226, in delete_asd
Jan 10 02:48:41 SV4SRV0015 celery[29138]: return self._call(requests.post, 'disks/{0}/asds/{1}/delete'.format(disk_id, asd_id), timeout=60)
Jan 10 02:48:41 SV4SRV0015 celery[29138]: File "/opt/OpenvStorage/ovs/extensions/plugins/asdmanager.py", line 95, in _call
Jan 10 02:48:41 SV4SRV0015 celery[29138]: response = method(**kwargs)
Jan 10 02:48:41 SV4SRV0015 celery[29138]: File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 112, in post
Jan 10 02:48:41 SV4SRV0015 celery[29138]: return request('post', url, data=data, json=json, **kwargs)
Jan 10 02:48:41 SV4SRV0015 celery[29138]: File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 58, in request
Jan 10 02:48:41 SV4SRV0015 celery[29138]: return session.request(method=method, url=url, **kwargs)
Jan 10 02:48:41 SV4SRV0015 celery[29138]: File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 508, in request
Jan 10 02:48:41 SV4SRV0015 celery[29138]: resp = self.send(prep, **send_kwargs)
Jan 10 02:48:41 SV4SRV0015 celery[29138]: File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 618, in send
Jan 10 02:48:41 SV4SRV0015 celery[29138]: r = adapter.send(request, **kwargs)
Jan 10 02:48:41 SV4SRV0015 celery[29138]: File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 521, in send
Jan 10 02:48:41 SV4SRV0015 celery[29138]: raise ReadTimeout(e, request=request)
Jan 10 02:48:41 SV4SRV0015 celery[29138]: ReadTimeout: None: None
After this, the disk got into failed state:
While trying to remove it again, I've received the message again that there is potential data loss:
Logs:
Jan 10 03:01:07 SV4SRV0021 celery[17944]: 2018-01-10 03:01:07 59100 -0500 - SV4SRV0021 - 17944/139781428029184 - celery/celery.worker.strategy - 72368 - INFO - Received task: albanode.reset_asd[024a093b-71a8-4e81-bb55-06321e4c9a1e]
Jan 10 03:01:07 SV4SRV0021 celery[17944]: 2018-01-10 03:01:07 60400 -0500 - SV4SRV0021 - 8414/139781428029184 - lib/ensure single - 72353 - INFO - Ensure single CHAINED mode - ID 1515571267_K38LIfgET4 - New task albanode.remove_asd with params {'node_guid': '6c61e174-f42e-4260-87be-11cdf0ca8407', 'expected_safety': {u'good': 366, u'critical': 4, u'lost': 8}, 'asd_id': u'dIlfVBdoeLNFvljyqz45C7mkkujpUSKp'} scheduled for execution
Jan 10 03:01:08 SV4SRV0021 celery[17944]: 2018-01-10 03:01:08 72300 -0500 - SV4SRV0021 - 8414/139781428029184 - celery/celery.redirected - 72356 - WARNING - 2018-01-10 03:01:08 72300 -0500 - SV4SRV0021 - 8414/139781428029184 - extensions/alba-cli - 72355 - WARNING - AlbaCLI call get-disk-safety took 0.97s
Jan 10 03:01:08 SV4SRV0021 celery[17944]: 2018-01-10 03:01:08 74100 -0500 - SV4SRV0021 - 17944/139781428029184 - celery/celery.worker.job - 72369 - ERROR - Task albanode.reset_asd[024a093b-71a8-4e81-bb55-06321e4c9a1e] raised unexpected: RuntimeError("Cannot remove ASD dIlfVBdoeLNFvljyqz45C7mkkujpUSKp as the current safety is not as expected ({'good': 365, 'critical': 5, 'lost': 8} vs {u'good': 366, u'critical': 4, u'lost': 8})",)
Jan 10 03:01:08 SV4SRV0021 celery[17944]: Traceback (most recent call last):
Jan 10 03:01:08 SV4SRV0021 celery[17944]: File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 240, in trace_task
Jan 10 03:01:08 SV4SRV0021 celery[17944]: R = retval = fun(*args, **kwargs)
Jan 10 03:01:08 SV4SRV0021 celery[17944]: File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 438, in __protected_call__
Jan 10 03:01:08 SV4SRV0021 celery[17944]: return self.run(*args, **kwargs)
Jan 10 03:01:08 SV4SRV0021 celery[17944]: File "/opt/OpenvStorage/ovs/lib/albanode.py", line 368, in reset_asd
Jan 10 03:01:08 SV4SRV0021 celery[17944]: expected_safety=expected_safety)
Jan 10 03:01:08 SV4SRV0021 celery[17944]: File "/usr/lib/python2.7/dist-packages/celery/local.py", line 188, in __call__
Jan 10 03:01:08 SV4SRV0021 celery[17944]: return self._get_current_object()(*a, **kw)
Jan 10 03:01:08 SV4SRV0021 celery[17944]: File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 439, in __protected_call__
Jan 10 03:01:08 SV4SRV0021 celery[17944]: return orig(self, *args, **kwargs)
Jan 10 03:01:08 SV4SRV0021 celery[17944]: File "/usr/lib/python2.7/dist-packages/celery/app/task.py", line 420, in __call__
Jan 10 03:01:08 SV4SRV0021 celery[17944]: return self.run(*args, **kwargs)
Jan 10 03:01:08 SV4SRV0021 celery[17944]: File "/opt/OpenvStorage/ovs/lib/helpers/decorators.py", line 461, in new_function
Jan 10 03:01:08 SV4SRV0021 celery[17944]: output = f(*args, **kwargs)
Jan 10 03:01:08 SV4SRV0021 celery[17944]: File "/opt/OpenvStorage/ovs/lib/albanode.py", line 310, in remove_asd
Jan 10 03:01:08 SV4SRV0021 celery[17944]: raise RuntimeError('Cannot remove ASD {0} as the current safety is not as expected ({1} vs {2})'.format(asd_id, final_safety, expected_safety))
Jan 10 03:01:08 SV4SRV0021 celery[17944]: RuntimeError: Cannot remove ASD dIlfVBdoeLNFvljyqz45C7mkkujpUSKp as the current safety is not as expected ({'good': 365, 'critical': 5, 'lost': 8} vs {u'good': 366, u'critical': 4, u'lost': 8})
As this is a removal of the entire VM from the SSD storage, we can use the previous mentioned 'workaround'. As in shut down the VM and then remove the ASD's.
The problem is, like @wimpers mentioned is that there is no way to detect in what context these ASDs are used... If there is a disk replacement to be done, it should be possible to clear a slot at once which takes in the factor of losing all it's osds on it.
The removal of an ASD in the cachebackend failed with the current safety is not as expected.
As this is an ASD in the cachebackend we should be able to remove the disk without any issues.