VDisk: Failed to retrieve the DTL configuration from storage driver

yongshengma commented 5 years ago

Hello,

Vpool service on 3 nodes (half) of the cluster (6 nodes) failed when I noticed that. One vdisk becomes unknown. I tried to restart it but failed:

RuntimeError: got fault response restartVolume

other log:

Apr 29 08:15:47 N05 celery: 2019-04-29 08:15:47 98000 +0800 - N05 - 29058/140549621679936 - lib/vdisk.py - dtl_checkup - 13536 - ERROR -     VDisk ftp with guid f665903e-f487-42ff-9ce4-ff734437acb5: Failed to retrieve the DTL configuration from storage driver
Apr 29 08:15:47 N05 celery: Traceback (most recent call last):
Apr 29 08:15:47 N05 celery: File "/opt/OpenvStorage/ovs/lib/vdisk.py", line 1178, in dtl_checkup
Apr 29 08:15:47 N05 celery: current_dtl_config = vdisk.storagedriver_client.get_dtl_config(volume_id, req_timeout_secs=5)
Apr 29 08:15:47 N05 celery: MaxRedirectsExceededException: getFailOverCacheConfig

Perhaps I will lose this vdisk if it cannot start up. Any idea please?

Best regards, Yongsheng

JeffreyDevloo commented 5 years ago

Hi Yongsheng

The MaxRedirectsExceededException indicates that the volumedriver doesn't know where the volume is living. Is this volume up and running? You can check if it is using:

from ovs.dal.hybrids.vdisk import VDisk
vdisk = VDisk('f665903e-f487-42ff-9ce4-ff734437acb5')
vdisk.info['live_status']

If it is not running, you'll have to restart it:

vdisk.storagedriver_client.restart_object(str(vdisk.volume_id), False)

Best regards

yongshengma commented 5 years ago

Hi @JeffreyDevloo

It is not running. It still cannot start.

In [1]: from ovs.dal.hybrids.vdisk import VDisk

In [2]: vdisk = VDisk('f665903e-f487-42ff-9ce4-ff734437acb5')

In [3]: vdisk.info['live_status']
Out[3]: 'NON_RUNNING'

In [4]: vdisk.storagedriver_client.restart_object(str(vdisk.volume_id), False)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-4-52b4bcc22742> in <module>()
----> 1 vdisk.storagedriver_client.restart_object(str(vdisk.volume_id), False)

RuntimeError: got fault response restartVolume

It cannot start either when I used True in restart_object().

JeffreyDevloo commented 5 years ago

Could you provide us the error shown in the volumedriver logs? You can check where the volume was last using

from ovs.dal.hybrids.storagerouter import StorageRouter
from ovs.dal.hybrids.vdisk import VDisk

vdisk = VDisk(...)
print StorageRouter(vdisk.storagerouter_guid).ip
print vdisk.vpool.name

On this host, you'll have to check for restartVolume using journalctl -u ovs-volumedriver_<vpool_name>

yongshengma commented 5 years ago

Hi @JeffreyDevloo OK. I found related log

Apr 28 21:34:03 N05 volumedriver_fs.sh[2822]: 2019-04-28 21:34:03 665253 +0800 - N05 - 2822/0x00007f08bce62700 - volumedriverfs/XMLRPCTimingWrapper - 0000000000d6d050 - error - execute: restartVolume Caught fungi::IOException: ZCOVetcher: No such namespace

context lines

Apr 28 21:34:03 N05 volumedriver_fs.sh[2822]: 2019-04-28 21:34:03 664882 +0800 - N05 - 2822/0x00007f08bce62700 - volumedriverfs/MDSMetaDataBackend - 0000000000d6d04a - info - ~MDSMetaDataBackend: faf6b95e-7ed4-4c9c-ac21-b596c58cb329: used clusters: 3872479
Apr 28 21:34:03 N05 volumedriver_fs.sh[2822]: 2019-04-28 21:34:03 664916 +0800 - N05 - 2822/0x00007f08bce62700 - volumedriverfs/MetaDataServerClientNG - 0000000000d6d04b - info - ~ClientNG: 0x7ef9e801c080: terminating
Apr 28 21:34:03 N05 volumedriver_fs.sh[2822]: 2019-04-28 21:34:03 664995 +0800 - N05 - 2822/0x00007f08bce62700 - volumedriverfs/LocORemConnection - 0000000000d6d04c - info - ~LocORemConnection: 0x7ef9e801df90: terminating
Apr 28 21:34:03 N05 volumedriver_fs.sh[2822]: 2019-04-28 21:34:03 665120 +0800 - N05 - 2822/0x00007f08bce62700 - volumedriverfs/VolumeFactory - 0000000000d6d04d - warning - local_restart_volume: Cannot restart volume for namespace faf6b95e-7ed4-4c9c-ac21-b596c58cb329 locally, reason should be in the logs: ZCOVetcher: No such namespace
Apr 28 21:34:03 N05 volumedriver_fs.sh[2822]: 2019-04-28 21:34:03 665190 +0800 - N05 - 2822/0x00007f08bce62700 - volumedriverfs/VolManager - 0000000000d6d04e - notice - Local Restart, Namespace faf6b95e-7ed4-4c9c-ac21-b596c58cb329, owner tag 173, EXCEPTION
Apr 28 21:34:03 N05 volumedriver_fs.sh[2822]: 2019-04-28 21:34:03 665206 +0800 - N05 - 2822/0x00007f08bce62700 - volumedriverfs/VFSLocalNode - 0000000000d6d04f - error - local_restart_volume_: faf6b95e-7ed4-4c9c-ac21-b596c58cb329: failed to restart volume: ZCOVetcher: No such namespace
Apr 28 21:34:03 N05 volumedriver_fs.sh[2822]: 2019-04-28 21:34:03 665253 +0800 - N05 - 2822/0x00007f08bce62700 - volumedriverfs/XMLRPCTimingWrapper - 0000000000d6d050 - error - execute: restartVolume Caught fungi::IOException: ZCOVetcher: No such namespace
Apr 28 21:34:03 N05 volumedriver_fs.sh[2822]: 2019-04-28 21:34:03 665316 +0800 - N05 - 2822/0x00007f08bce62700 - volumedriverfs/XMLRPCConnection - 0000000000d6d051 - info - operator(): Closing socket 208

redlicha commented 5 years ago

Hi @yongshengma ,

from the log snippet you provided, it looks as though the instance owning the volume still has metadata cached (which is possibly outdated) but the local write buffer does not know about the volume (that's where the "No such namespace" is coming from). The following steps should allow to recover from this (assuming that the DTL was ok at the time the volumedriver went down):

move the md and tlog directories for the volume aside on that node, e.g. /mnt/ssd1/<vpool-name>-db_md_1/faf6b95e-7ed4-4c9c-ac21-b596c58cb329 -> /mnt/ssd1/<vpool-name>-db_md_1/faf6b95e-7ed4-4c9c-ac21-b596c58cb329.OLD and /mnt/ssd1/<vpool-name>-db_tlog_1/faf6b95e-7ed4-4c9c-ac21-b596c58cb329 -> /mnt/ssd1/<vpool-name>-db_tlog_1/faf6b95e-7ed4-4c9c-ac21-b596c58cb329.OLD
restart the volume via the migrate call instead of the restart_object call (this fetches the relevant metadata and data from the backend and the DTL). You need to specify the ID of the owning instance as node_id (IOW migrate it to itself) .

Hope this helps.

Edit from JeffreyDevloo: The owner instance can be easily fetched using the DAL:

from ovs.dal.lists.vdisklist import VDiskList
vdisk = VDiskList.get_vdisk_by_volume_id('faf6b95e-7ed4-4c9c-ac21-b596c58cb329')
print vdisk.volume_id

yongshengma commented 5 years ago

Hi @redlicha

I moved the md and tlog directories according to your suggestion and then did:

In [7]: from ovs.dal.lists.vdisklist import VDiskList

In [8]: vdisk = VDiskList.get_vdisk_by_volume_id('faf6b95e-7ed4-4c9c-ac21-b596c58cb329')

In [9]: print vdisk.storagedriver_id
poolWqzfQLqWu7qSjW4D

In [10]: from ovs.lib.vdisk import VDiskController

In [11]: volume_id = 'faf6b95e-7ed4-4c9c-ac21-b596c58cb329'

In [12]: new_owner_id = 'poolWqzfQLqWu7qSjW4D'

In [13]: VDiskController.migrate_from_voldrv(volume_id, new_owner_id)

popped up many line of logs and stopped here:

2019-05-05 12:06:51 34800 +0800 - N05 - 5700/140139130042176 - lib/mdsservice.py - _sync_vdisk_to_reality - 1056 - INFO - vDisk f992ee2b-b574-47c1-8b51-cc7d89e5dd1c - hybase-d7839243-e9f7-4c68-8ca8-9099d4e2c242: Syncing to reality
2019-05-05 12:06:51 37000 +0800 - N05 - 5700/140139130042176 - lib/mdsservice.py - _sync_vdisk_to_reality - 1057 - INFO - vDisk f992ee2b-b574-47c1-8b51-cc7d89e5dd1c - hybase-d7839243-e9f7-4c68-8ca8-9099d4e2c242: Synced to reality
2019-05-05 12:06:51 37000 +0800 - N05 - 5700/140139130042176 - lib/decorators.py - log_message - 1058 - INFO - Ensure single CHAINED mode - ID 1557029211_oSzENTowtl - Task ovs.mds.ensure_safety finished successfully
2019-05-05 12:06:51 37500 +0800 - N05 - 5700/140139130042176 - lib/decorators.py - log_message - 1059 - INFO - Ensure single CHAINED mode - ID 1557029211_XZvBTfxt45 - New task ovs.mds.ensure_safety with params {'vdisk_guid': 'fba3c3ed-51a5-4d48-a25d-c18b5bc68def'} scheduled for execution
2019-05-05 12:06:51 37800 +0800 - N05 - 5700/140139130042176 - lib/mdsservice.py - ensure_safety - 1060 - INFO - vDisk fba3c3ed-51a5-4d48-a25d-c18b5bc68def - Start checkup for vDisk prod_IPIAM_node02
2019-05-05 12:06:51 46500 +0800 - N05 - 5700/140139130042176 - lib/mdsservice.py - ensure_safety - 1061 - INFO - vDisk fba3c3ed-51a5-4d48-a25d-c18b5bc68def - Current configuration: [{'ip': '159.226.241.77', 'port': 26300}, {'ip': '159.226.241.47', 'port': 26300}, {'ip': '159.226.241.68', 'port': 26300}]
2019-05-05 12:06:51 46500 +0800 - N05 - 5700/140139130042176 - lib/mdsservice.py - ensure_safety - 1062 - INFO - vDisk fba3c3ed-51a5-4d48-a25d-c18b5bc68def - No reconfiguration required
2019-05-05 12:06:51 46500 +0800 - N05 - 5700/140139130042176 - lib/mdsservice.py - _sync_vdisk_to_reality - 1063 - INFO - vDisk fba3c3ed-51a5-4d48-a25d-c18b5bc68def - prod_IPIAM_node02: Syncing to reality
2019-05-05 12:06:51 48500 +0800 - N05 - 5700/140139130042176 - lib/mdsservice.py - _sync_vdisk_to_reality - 1064 - INFO - vDisk fba3c3ed-51a5-4d48-a25d-c18b5bc68def - prod_IPIAM_node02: Synced to reality
2019-05-05 12:06:51 48500 +0800 - N05 - 5700/140139130042176 - lib/decorators.py - log_message - 1065 - INFO - Ensure single CHAINED mode - ID 1557029211_XZvBTfxt45 - Task ovs.mds.ensure_safety finished successfully
2019-05-05 12:06:51 49000 +0800 - N05 - 5700/140139130042176 - lib/decorators.py - log_message - 1066 - INFO - Ensure single CHAINED mode - ID 1557029211_o0KOrqU9Yc - New task ovs.mds.ensure_safety with params {'vdisk_guid': 'ff1f42da-c75e-468b-956a-c4b613105471'} scheduled for execution
2019-05-05 12:06:51 49300 +0800 - N05 - 5700/140139130042176 - lib/mdsservice.py - ensure_safety - 1067 - INFO - vDisk ff1f42da-c75e-468b-956a-c4b613105471 - Start checkup for vDisk HR-WEB
2019-05-05 12:06:51 50500 +0800 - N05 - 5700/140139130042176 - lib/mdsservice.py - mds_checkup - 1068 - ERROR - Ensure safety for vDisk HR-WEB with guid ff1f42da-c75e-468b-956a-c4b613105471 failed
Traceback (most recent call last):
  File "/opt/OpenvStorage/ovs/lib/mdsservice.py", line 410, in mds_checkup
    MDSServiceController.ensure_safety(vdisk_guid=vdisk.guid)
  File "/usr/lib/python2.7/site-packages/celery/local.py", line 188, in __call__
    return self._get_current_object()(*a, **kw)
  File "/usr/lib/python2.7/site-packages/celery/app/task.py", line 420, in __call__
    return self.run(*args, **kwargs)
  File "/opt/OpenvStorage/ovs/lib/helpers/decorators.py", line 461, in new_function
    output = f(*args, **kwargs)
  File "/opt/OpenvStorage/ovs/lib/mdsservice.py", line 499, in ensure_safety
    raise RuntimeError('vDisk {0} is not {1}, cannot update MDS configuration'.format(vdisk.guid, VDisk.STATUSES.RUNNING))
RuntimeError: vDisk ff1f42da-c75e-468b-956a-c4b613105471 is not RUNNING, cannot update MDS configuration
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-4-66f458b53c1c> in <module>()
----> 1 VDiskController.migrate_from_voldrv(volume_id, new_owner_id)

/usr/lib/python2.7/site-packages/celery/local.pyc in __call__(self, *a, **kw)
    186 
    187     def __call__(self, *a, **kw):
--> 188         return self._get_current_object()(*a, **kw)
    189 
    190     def __len__(self):

/usr/lib/python2.7/site-packages/celery/app/task.pyc in __call__(self, *args, **kwargs)
    418             if self.__self__ is not None:
    419                 return self.run(self.__self__, *args, **kwargs)
--> 420             return self.run(*args, **kwargs)
    421         finally:
    422             self.pop_request()

/opt/OpenvStorage/ovs/lib/helpers/decorators.pyc in new_function(*args, **kwargs)
     86 
     87             # Call the function
---> 88             return f(*args, **kwargs)
     89 
     90         new_function.__name__ = f.__name__

/opt/OpenvStorage/ovs/lib/vdisk.pyc in migrate_from_voldrv(volume_id, new_owner_id)
    309             if sd is not None:
    310                 VDiskController._logger.info('Migration - Guid {0} - ID {1} - Storage Router {2} is the new owner of vDisk {3}'.format(vdisk.guid, vdisk.volume_id, sd.storagerouter.name, vdisk.name))
--> 311             MDSServiceController.mds_checkup()
    312             VDiskController.dtl_checkup(vdisk_guid=vdisk.guid)
    313 

/usr/lib/python2.7/site-packages/celery/local.pyc in __call__(self, *a, **kw)
    186 
    187     def __call__(self, *a, **kw):
--> 188         return self._get_current_object()(*a, **kw)
    189 
    190     def __len__(self):

/usr/lib/python2.7/site-packages/celery/app/task.pyc in __call__(self, *args, **kwargs)
    418             if self.__self__ is not None:
    419                 return self.run(self.__self__, *args, **kwargs)
--> 420             return self.run(*args, **kwargs)
    421         finally:
    422             self.pop_request()

/opt/OpenvStorage/ovs/lib/helpers/decorators.pyc in new_function(self, *args, **kwargs)
    459                             if unittest_mode is True:
    460                                 Decorators.unittest_thread_info_by_name[thread_name] = ('EXECUTING', None)
--> 461                             output = f(*args, **kwargs)
    462                             if unittest_mode is True:
    463                                 Decorators.unittest_thread_info_by_name[thread_name] = ('FINISHED', None)

/opt/OpenvStorage/ovs/lib/mdsservice.pyc in mds_checkup()
    414                     failures.append(message)
    415         if len(failures) > 0:
--> 416             raise RuntimeError('\n - ' + '\n - '.join(failures))
    417         MDSServiceController._logger.info('Finished')
    418 

RuntimeError: 
 - Ensure safety for vDisk prod_IPESB_hsp_disk2 with guid 58ed93f5-39b7-4c91-b3e0-6575a7c7c8de failed
 - Ensure safety for vDisk test-HR-BaseDB with guid 866d3df4-4cb2-42e2-837c-ddf668ba6d2c failed
 - Ensure safety for vDisk ftp with guid f665903e-f487-42ff-9ce4-ff734437acb5 failed
 - Ensure safety for vDisk HR-WEB with guid ff1f42da-c75e-468b-956a-c4b613105471 failed

There're 4 vdisks indeed that failed to restart. I reported just one of them in previous posts. In VDiskController.migrate_from_voldrv(volume_id, new_owner_id) , the new_owner_idi used is the storage driver that currently owns this volume.

yongshengma commented 5 years ago

Good news! move call works instead. Manipulation on volume's md and tlog is still necessary before that.

In [1]: from ovs.dal.lists.vdisklist import VDiskList

In [2]: vdisk = VDiskList.get_vdisk_by_volume_id('faf6b95e-7ed4-4c9c-ac21-b596c58cb329')

In [3]: print vdisk.guid
58ed93f5-39b7-4c91-b3e0-6575a7c7c8de

In [4]: print vdisk.storagerouter_guid
b0f91fbd-29be-4321-9428-821fc10aa112

In [5]: 

In [5]: from ovs.lib.vdisk import VDiskController

In [6]: vdisk_guid = '58ed93f5-39b7-4c91-b3e0-6575a7c7c8de'

In [7]: target_storagerouter_guid = 'b0f91fbd-29be-4321-9428-821fc10aa112'

In [8]: VDiskController.move(vdisk_guid, target_storagerouter_guid)
2019-05-05 14:27:39 01300 +0800 - N05 - 45082/140071580366656 - lib/decorators.py - log_message - 0 - INFO - Ensure single CHAINED mode - ID 1557037659_znxDWhUvAd - New task ovs.mds.ensure_safety with params {'vdisk_guid': '58ed93f5-39b7-4c91-b3e0-6575a7c7c8de'} scheduled for execution
2019-05-05 14:27:39 01800 +0800 - N05 - 45082/140071580366656 - lib/mdsservice.py - ensure_safety - 1 - INFO - vDisk 58ed93f5-39b7-4c91-b3e0-6575a7c7c8de - Start checkup for vDisk prod_IPESB_hsp_disk2
2019-05-05 14:27:39 14300 +0800 - N05 - 45082/140071580366656 - lib/mdsservice.py - ensure_safety - 2 - INFO - vDisk 58ed93f5-39b7-4c91-b3e0-6575a7c7c8de - Current configuration: [{'ip': '159.226.241.77', 'port': 26300}, {'ip': '159.226.241.49', 'port': 26300}, {'ip': '159.226.241.48', 'port': 26300}]
2019-05-05 14:27:39 14300 +0800 - N05 - 45082/140071580366656 - lib/mdsservice.py - ensure_safety - 3 - INFO - vDisk 58ed93f5-39b7-4c91-b3e0-6575a7c7c8de - Reconfiguration required. Reasons:
2019-05-05 14:27:39 14300 +0800 - N05 - 45082/140071580366656 - lib/mdsservice.py - ensure_safety - 4 - INFO - vDisk 58ed93f5-39b7-4c91-b3e0-6575a7c7c8de -    * Master 159.226.241.77:26300 is not local - Current location: 159.226.241.77 - Expected location: 159.226.241.80
2019-05-05 14:27:39 63500 +0800 - N05 - 45082/140071580366656 - lib/mdsservice.py - _sync_vdisk_to_reality - 5 - INFO - vDisk 58ed93f5-39b7-4c91-b3e0-6575a7c7c8de - prod_IPESB_hsp_disk2: Syncing to reality
2019-05-05 14:27:39 72900 +0800 - N05 - 45082/140071580366656 - lib/mdsservice.py - _sync_vdisk_to_reality - 6 - INFO - vDisk 58ed93f5-39b7-4c91-b3e0-6575a7c7c8de - prod_IPESB_hsp_disk2: Synced to reality
2019-05-05 14:27:39 73500 +0800 - N05 - 45082/140071580366656 - lib/mdsservice.py - ensure_safety - 7 - INFO - vDisk 58ed93f5-39b7-4c91-b3e0-6575a7c7c8de: Completed
2019-05-05 14:27:39 73600 +0800 - N05 - 45082/140071580366656 - lib/decorators.py - log_message - 8 - INFO - Ensure single CHAINED mode - ID 1557037659_znxDWhUvAd - Task ovs.mds.ensure_safety finished successfully

Appreciate! I got all 4 failed vdisks back in this way, but I would still like to know a bit more about this part if possible.

JeffreyDevloo commented 5 years ago

Hi @yongshengma

Looking into your first code snippet where you used VDiskController.migrate_from_voldr wouldn't work in this case. The documentation of this function can be found here: https://github.com/openvstorage/framework/blob/develop/ovs/lib/vdisk.py#L312 It's the code that we run after a volume was moved. In this code path, we try to put the MDS master on the host where the volume is running for better performance The reason why it did not help in your case is because the volume was not running and needed to be backend restarted.

The VDiskController.move will execute the migrate call of the volumedriver (as suggested by @redlicha). This will move the volume to the specified host and start it there. By removing the directories, it will have done a complete backend restart, resolving the problem you were seeing there.

Best regards

Jeffrey Devloo

openvstorage / framework

VDisk: Failed to retrieve the DTL configuration from storage driver #2291