Closed yongshengma closed 5 years ago
Hi Yongsheng
The MaxRedirectsExceededException
indicates that the volumedriver doesn't know where the volume is living. Is this volume up and running?
You can check if it is using:
from ovs.dal.hybrids.vdisk import VDisk
vdisk = VDisk('f665903e-f487-42ff-9ce4-ff734437acb5')
vdisk.info['live_status']
If it is not running, you'll have to restart it:
vdisk.storagedriver_client.restart_object(str(vdisk.volume_id), False)
Best regards
Hi @JeffreyDevloo
It is not running. It still cannot start.
In [1]: from ovs.dal.hybrids.vdisk import VDisk
In [2]: vdisk = VDisk('f665903e-f487-42ff-9ce4-ff734437acb5')
In [3]: vdisk.info['live_status']
Out[3]: 'NON_RUNNING'
In [4]: vdisk.storagedriver_client.restart_object(str(vdisk.volume_id), False)
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-4-52b4bcc22742> in <module>()
----> 1 vdisk.storagedriver_client.restart_object(str(vdisk.volume_id), False)
RuntimeError: got fault response restartVolume
It cannot start either when I used True
in restart_object().
Could you provide us the error shown in the volumedriver logs? You can check where the volume was last using
from ovs.dal.hybrids.storagerouter import StorageRouter
from ovs.dal.hybrids.vdisk import VDisk
vdisk = VDisk(...)
print StorageRouter(vdisk.storagerouter_guid).ip
print vdisk.vpool.name
On this host, you'll have to check for restartVolume
using journalctl -u ovs-volumedriver_<vpool_name>
Hi @JeffreyDevloo OK. I found related log
Apr 28 21:34:03 N05 volumedriver_fs.sh[2822]: 2019-04-28 21:34:03 665253 +0800 - N05 - 2822/0x00007f08bce62700 - volumedriverfs/XMLRPCTimingWrapper - 0000000000d6d050 - error - execute: restartVolume Caught fungi::IOException: ZCOVetcher: No such namespace
context lines
Apr 28 21:34:03 N05 volumedriver_fs.sh[2822]: 2019-04-28 21:34:03 664882 +0800 - N05 - 2822/0x00007f08bce62700 - volumedriverfs/MDSMetaDataBackend - 0000000000d6d04a - info - ~MDSMetaDataBackend: faf6b95e-7ed4-4c9c-ac21-b596c58cb329: used clusters: 3872479
Apr 28 21:34:03 N05 volumedriver_fs.sh[2822]: 2019-04-28 21:34:03 664916 +0800 - N05 - 2822/0x00007f08bce62700 - volumedriverfs/MetaDataServerClientNG - 0000000000d6d04b - info - ~ClientNG: 0x7ef9e801c080: terminating
Apr 28 21:34:03 N05 volumedriver_fs.sh[2822]: 2019-04-28 21:34:03 664995 +0800 - N05 - 2822/0x00007f08bce62700 - volumedriverfs/LocORemConnection - 0000000000d6d04c - info - ~LocORemConnection: 0x7ef9e801df90: terminating
Apr 28 21:34:03 N05 volumedriver_fs.sh[2822]: 2019-04-28 21:34:03 665120 +0800 - N05 - 2822/0x00007f08bce62700 - volumedriverfs/VolumeFactory - 0000000000d6d04d - warning - local_restart_volume: Cannot restart volume for namespace faf6b95e-7ed4-4c9c-ac21-b596c58cb329 locally, reason should be in the logs: ZCOVetcher: No such namespace
Apr 28 21:34:03 N05 volumedriver_fs.sh[2822]: 2019-04-28 21:34:03 665190 +0800 - N05 - 2822/0x00007f08bce62700 - volumedriverfs/VolManager - 0000000000d6d04e - notice - Local Restart, Namespace faf6b95e-7ed4-4c9c-ac21-b596c58cb329, owner tag 173, EXCEPTION
Apr 28 21:34:03 N05 volumedriver_fs.sh[2822]: 2019-04-28 21:34:03 665206 +0800 - N05 - 2822/0x00007f08bce62700 - volumedriverfs/VFSLocalNode - 0000000000d6d04f - error - local_restart_volume_: faf6b95e-7ed4-4c9c-ac21-b596c58cb329: failed to restart volume: ZCOVetcher: No such namespace
Apr 28 21:34:03 N05 volumedriver_fs.sh[2822]: 2019-04-28 21:34:03 665253 +0800 - N05 - 2822/0x00007f08bce62700 - volumedriverfs/XMLRPCTimingWrapper - 0000000000d6d050 - error - execute: restartVolume Caught fungi::IOException: ZCOVetcher: No such namespace
Apr 28 21:34:03 N05 volumedriver_fs.sh[2822]: 2019-04-28 21:34:03 665316 +0800 - N05 - 2822/0x00007f08bce62700 - volumedriverfs/XMLRPCConnection - 0000000000d6d051 - info - operator(): Closing socket 208
Hi @yongshengma ,
from the log snippet you provided, it looks as though the instance owning the volume still has metadata cached (which is possibly outdated) but the local write buffer does not know about the volume (that's where the "No such namespace" is coming from). The following steps should allow to recover from this (assuming that the DTL was ok at the time the volumedriver went down):
md
and tlog
directories for the volume aside on that node, e.g. /mnt/ssd1/<vpool-name>-db_md_1/faf6b95e-7ed4-4c9c-ac21-b596c58cb329
-> /mnt/ssd1/<vpool-name>-db_md_1/faf6b95e-7ed4-4c9c-ac21-b596c58cb329.OLD
and /mnt/ssd1/<vpool-name>-db_tlog_1/faf6b95e-7ed4-4c9c-ac21-b596c58cb329
-> /mnt/ssd1/<vpool-name>-db_tlog_1/faf6b95e-7ed4-4c9c-ac21-b596c58cb329.OLD
migrate
call instead of the restart_object
call (this fetches the relevant metadata and data from the backend and the DTL). You need to specify the ID of the owning instance as node_id
(IOW migrate it to itself)
.Hope this helps.
Edit from JeffreyDevloo: The owner instance can be easily fetched using the DAL:
from ovs.dal.lists.vdisklist import VDiskList
vdisk = VDiskList.get_vdisk_by_volume_id('faf6b95e-7ed4-4c9c-ac21-b596c58cb329')
print vdisk.volume_id
Hi @redlicha
I moved the md and tlog directories according to your suggestion and then did:
In [7]: from ovs.dal.lists.vdisklist import VDiskList
In [8]: vdisk = VDiskList.get_vdisk_by_volume_id('faf6b95e-7ed4-4c9c-ac21-b596c58cb329')
In [9]: print vdisk.storagedriver_id
poolWqzfQLqWu7qSjW4D
In [10]: from ovs.lib.vdisk import VDiskController
In [11]: volume_id = 'faf6b95e-7ed4-4c9c-ac21-b596c58cb329'
In [12]: new_owner_id = 'poolWqzfQLqWu7qSjW4D'
In [13]: VDiskController.migrate_from_voldrv(volume_id, new_owner_id)
popped up many line of logs and stopped here:
2019-05-05 12:06:51 34800 +0800 - N05 - 5700/140139130042176 - lib/mdsservice.py - _sync_vdisk_to_reality - 1056 - INFO - vDisk f992ee2b-b574-47c1-8b51-cc7d89e5dd1c - hybase-d7839243-e9f7-4c68-8ca8-9099d4e2c242: Syncing to reality
2019-05-05 12:06:51 37000 +0800 - N05 - 5700/140139130042176 - lib/mdsservice.py - _sync_vdisk_to_reality - 1057 - INFO - vDisk f992ee2b-b574-47c1-8b51-cc7d89e5dd1c - hybase-d7839243-e9f7-4c68-8ca8-9099d4e2c242: Synced to reality
2019-05-05 12:06:51 37000 +0800 - N05 - 5700/140139130042176 - lib/decorators.py - log_message - 1058 - INFO - Ensure single CHAINED mode - ID 1557029211_oSzENTowtl - Task ovs.mds.ensure_safety finished successfully
2019-05-05 12:06:51 37500 +0800 - N05 - 5700/140139130042176 - lib/decorators.py - log_message - 1059 - INFO - Ensure single CHAINED mode - ID 1557029211_XZvBTfxt45 - New task ovs.mds.ensure_safety with params {'vdisk_guid': 'fba3c3ed-51a5-4d48-a25d-c18b5bc68def'} scheduled for execution
2019-05-05 12:06:51 37800 +0800 - N05 - 5700/140139130042176 - lib/mdsservice.py - ensure_safety - 1060 - INFO - vDisk fba3c3ed-51a5-4d48-a25d-c18b5bc68def - Start checkup for vDisk prod_IPIAM_node02
2019-05-05 12:06:51 46500 +0800 - N05 - 5700/140139130042176 - lib/mdsservice.py - ensure_safety - 1061 - INFO - vDisk fba3c3ed-51a5-4d48-a25d-c18b5bc68def - Current configuration: [{'ip': '159.226.241.77', 'port': 26300}, {'ip': '159.226.241.47', 'port': 26300}, {'ip': '159.226.241.68', 'port': 26300}]
2019-05-05 12:06:51 46500 +0800 - N05 - 5700/140139130042176 - lib/mdsservice.py - ensure_safety - 1062 - INFO - vDisk fba3c3ed-51a5-4d48-a25d-c18b5bc68def - No reconfiguration required
2019-05-05 12:06:51 46500 +0800 - N05 - 5700/140139130042176 - lib/mdsservice.py - _sync_vdisk_to_reality - 1063 - INFO - vDisk fba3c3ed-51a5-4d48-a25d-c18b5bc68def - prod_IPIAM_node02: Syncing to reality
2019-05-05 12:06:51 48500 +0800 - N05 - 5700/140139130042176 - lib/mdsservice.py - _sync_vdisk_to_reality - 1064 - INFO - vDisk fba3c3ed-51a5-4d48-a25d-c18b5bc68def - prod_IPIAM_node02: Synced to reality
2019-05-05 12:06:51 48500 +0800 - N05 - 5700/140139130042176 - lib/decorators.py - log_message - 1065 - INFO - Ensure single CHAINED mode - ID 1557029211_XZvBTfxt45 - Task ovs.mds.ensure_safety finished successfully
2019-05-05 12:06:51 49000 +0800 - N05 - 5700/140139130042176 - lib/decorators.py - log_message - 1066 - INFO - Ensure single CHAINED mode - ID 1557029211_o0KOrqU9Yc - New task ovs.mds.ensure_safety with params {'vdisk_guid': 'ff1f42da-c75e-468b-956a-c4b613105471'} scheduled for execution
2019-05-05 12:06:51 49300 +0800 - N05 - 5700/140139130042176 - lib/mdsservice.py - ensure_safety - 1067 - INFO - vDisk ff1f42da-c75e-468b-956a-c4b613105471 - Start checkup for vDisk HR-WEB
2019-05-05 12:06:51 50500 +0800 - N05 - 5700/140139130042176 - lib/mdsservice.py - mds_checkup - 1068 - ERROR - Ensure safety for vDisk HR-WEB with guid ff1f42da-c75e-468b-956a-c4b613105471 failed
Traceback (most recent call last):
File "/opt/OpenvStorage/ovs/lib/mdsservice.py", line 410, in mds_checkup
MDSServiceController.ensure_safety(vdisk_guid=vdisk.guid)
File "/usr/lib/python2.7/site-packages/celery/local.py", line 188, in __call__
return self._get_current_object()(*a, **kw)
File "/usr/lib/python2.7/site-packages/celery/app/task.py", line 420, in __call__
return self.run(*args, **kwargs)
File "/opt/OpenvStorage/ovs/lib/helpers/decorators.py", line 461, in new_function
output = f(*args, **kwargs)
File "/opt/OpenvStorage/ovs/lib/mdsservice.py", line 499, in ensure_safety
raise RuntimeError('vDisk {0} is not {1}, cannot update MDS configuration'.format(vdisk.guid, VDisk.STATUSES.RUNNING))
RuntimeError: vDisk ff1f42da-c75e-468b-956a-c4b613105471 is not RUNNING, cannot update MDS configuration
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-4-66f458b53c1c> in <module>()
----> 1 VDiskController.migrate_from_voldrv(volume_id, new_owner_id)
/usr/lib/python2.7/site-packages/celery/local.pyc in __call__(self, *a, **kw)
186
187 def __call__(self, *a, **kw):
--> 188 return self._get_current_object()(*a, **kw)
189
190 def __len__(self):
/usr/lib/python2.7/site-packages/celery/app/task.pyc in __call__(self, *args, **kwargs)
418 if self.__self__ is not None:
419 return self.run(self.__self__, *args, **kwargs)
--> 420 return self.run(*args, **kwargs)
421 finally:
422 self.pop_request()
/opt/OpenvStorage/ovs/lib/helpers/decorators.pyc in new_function(*args, **kwargs)
86
87 # Call the function
---> 88 return f(*args, **kwargs)
89
90 new_function.__name__ = f.__name__
/opt/OpenvStorage/ovs/lib/vdisk.pyc in migrate_from_voldrv(volume_id, new_owner_id)
309 if sd is not None:
310 VDiskController._logger.info('Migration - Guid {0} - ID {1} - Storage Router {2} is the new owner of vDisk {3}'.format(vdisk.guid, vdisk.volume_id, sd.storagerouter.name, vdisk.name))
--> 311 MDSServiceController.mds_checkup()
312 VDiskController.dtl_checkup(vdisk_guid=vdisk.guid)
313
/usr/lib/python2.7/site-packages/celery/local.pyc in __call__(self, *a, **kw)
186
187 def __call__(self, *a, **kw):
--> 188 return self._get_current_object()(*a, **kw)
189
190 def __len__(self):
/usr/lib/python2.7/site-packages/celery/app/task.pyc in __call__(self, *args, **kwargs)
418 if self.__self__ is not None:
419 return self.run(self.__self__, *args, **kwargs)
--> 420 return self.run(*args, **kwargs)
421 finally:
422 self.pop_request()
/opt/OpenvStorage/ovs/lib/helpers/decorators.pyc in new_function(self, *args, **kwargs)
459 if unittest_mode is True:
460 Decorators.unittest_thread_info_by_name[thread_name] = ('EXECUTING', None)
--> 461 output = f(*args, **kwargs)
462 if unittest_mode is True:
463 Decorators.unittest_thread_info_by_name[thread_name] = ('FINISHED', None)
/opt/OpenvStorage/ovs/lib/mdsservice.pyc in mds_checkup()
414 failures.append(message)
415 if len(failures) > 0:
--> 416 raise RuntimeError('\n - ' + '\n - '.join(failures))
417 MDSServiceController._logger.info('Finished')
418
RuntimeError:
- Ensure safety for vDisk prod_IPESB_hsp_disk2 with guid 58ed93f5-39b7-4c91-b3e0-6575a7c7c8de failed
- Ensure safety for vDisk test-HR-BaseDB with guid 866d3df4-4cb2-42e2-837c-ddf668ba6d2c failed
- Ensure safety for vDisk ftp with guid f665903e-f487-42ff-9ce4-ff734437acb5 failed
- Ensure safety for vDisk HR-WEB with guid ff1f42da-c75e-468b-956a-c4b613105471 failed
There're 4 vdisks indeed that failed to restart. I reported just one of them in previous posts. In VDiskController.migrate_from_voldrv(volume_id, new_owner_id)
, the new_owner_id
i used is the storage driver that currently owns this volume.
Good news! move
call works instead. Manipulation on volume's md and tlog is still necessary before that.
In [1]: from ovs.dal.lists.vdisklist import VDiskList
In [2]: vdisk = VDiskList.get_vdisk_by_volume_id('faf6b95e-7ed4-4c9c-ac21-b596c58cb329')
In [3]: print vdisk.guid
58ed93f5-39b7-4c91-b3e0-6575a7c7c8de
In [4]: print vdisk.storagerouter_guid
b0f91fbd-29be-4321-9428-821fc10aa112
In [5]:
In [5]: from ovs.lib.vdisk import VDiskController
In [6]: vdisk_guid = '58ed93f5-39b7-4c91-b3e0-6575a7c7c8de'
In [7]: target_storagerouter_guid = 'b0f91fbd-29be-4321-9428-821fc10aa112'
In [8]: VDiskController.move(vdisk_guid, target_storagerouter_guid)
2019-05-05 14:27:39 01300 +0800 - N05 - 45082/140071580366656 - lib/decorators.py - log_message - 0 - INFO - Ensure single CHAINED mode - ID 1557037659_znxDWhUvAd - New task ovs.mds.ensure_safety with params {'vdisk_guid': '58ed93f5-39b7-4c91-b3e0-6575a7c7c8de'} scheduled for execution
2019-05-05 14:27:39 01800 +0800 - N05 - 45082/140071580366656 - lib/mdsservice.py - ensure_safety - 1 - INFO - vDisk 58ed93f5-39b7-4c91-b3e0-6575a7c7c8de - Start checkup for vDisk prod_IPESB_hsp_disk2
2019-05-05 14:27:39 14300 +0800 - N05 - 45082/140071580366656 - lib/mdsservice.py - ensure_safety - 2 - INFO - vDisk 58ed93f5-39b7-4c91-b3e0-6575a7c7c8de - Current configuration: [{'ip': '159.226.241.77', 'port': 26300}, {'ip': '159.226.241.49', 'port': 26300}, {'ip': '159.226.241.48', 'port': 26300}]
2019-05-05 14:27:39 14300 +0800 - N05 - 45082/140071580366656 - lib/mdsservice.py - ensure_safety - 3 - INFO - vDisk 58ed93f5-39b7-4c91-b3e0-6575a7c7c8de - Reconfiguration required. Reasons:
2019-05-05 14:27:39 14300 +0800 - N05 - 45082/140071580366656 - lib/mdsservice.py - ensure_safety - 4 - INFO - vDisk 58ed93f5-39b7-4c91-b3e0-6575a7c7c8de - * Master 159.226.241.77:26300 is not local - Current location: 159.226.241.77 - Expected location: 159.226.241.80
2019-05-05 14:27:39 63500 +0800 - N05 - 45082/140071580366656 - lib/mdsservice.py - _sync_vdisk_to_reality - 5 - INFO - vDisk 58ed93f5-39b7-4c91-b3e0-6575a7c7c8de - prod_IPESB_hsp_disk2: Syncing to reality
2019-05-05 14:27:39 72900 +0800 - N05 - 45082/140071580366656 - lib/mdsservice.py - _sync_vdisk_to_reality - 6 - INFO - vDisk 58ed93f5-39b7-4c91-b3e0-6575a7c7c8de - prod_IPESB_hsp_disk2: Synced to reality
2019-05-05 14:27:39 73500 +0800 - N05 - 45082/140071580366656 - lib/mdsservice.py - ensure_safety - 7 - INFO - vDisk 58ed93f5-39b7-4c91-b3e0-6575a7c7c8de: Completed
2019-05-05 14:27:39 73600 +0800 - N05 - 45082/140071580366656 - lib/decorators.py - log_message - 8 - INFO - Ensure single CHAINED mode - ID 1557037659_znxDWhUvAd - Task ovs.mds.ensure_safety finished successfully
Appreciate! I got all 4 failed vdisks back in this way, but I would still like to know a bit more about this part if possible.
Hi @yongshengma
Looking into your first code snippet where you used VDiskController.migrate_from_voldr
wouldn't work in this case.
The documentation of this function can be found here: https://github.com/openvstorage/framework/blob/develop/ovs/lib/vdisk.py#L312
It's the code that we run after a volume was moved. In this code path, we try to put the MDS master on the host where the volume is running for better performance
The reason why it did not help in your case is because the volume was not running and needed to be backend restarted.
The VDiskController.move
will execute the migrate
call of the volumedriver (as suggested by @redlicha). This will move the volume to the specified host and start it there. By removing the directories, it will have done a complete backend restart, resolving the problem you were seeing there.
Best regards
Jeffrey Devloo
Hello,
Vpool service on 3 nodes (half) of the cluster (6 nodes) failed when I noticed that. One vdisk becomes unknown. I tried to restart it but failed:
RuntimeError: got fault response restartVolume
other log:
Perhaps I will lose this vdisk if it cannot start up. Any idea please?
Best regards, Yongsheng