got fault response unlink error when healthcheck volume still exists in framework

jeroenmaelbrancke commented 7 years ago

Problem description

When the .raw file from the healthcheck is not been cleaned in the framework we received a got fault response unlink exception.

[FAILED] Volumedriver of vPool 'data02' seems to have `runtime` problems. Got `got fault response unlink` while executing.

Possible root of the problem

The .raw is still present on the framework even if the volume already is deleted from the volumedriver.

[WARNING] Detected volumes that are MISSING in volumedriver but ARE in ovsdb in vPoolvpool name: data02 - vdisk guid(s):39beeac4-e71f-43bb-9ff4-f1687b441980

Possible solution

Check if the healthcheck volume is still present in the model and if needed deleted the volume. I know the healthcheck is not a self-healing product but this would be a good solution (only for the volumes created by the healthcheck).

Temporary solution

Delete the vdisk from the model

In [24]: from ovs.lib.vdisk import VDiskController
In [25]: vdisk.guid
Out[25]: '39beeac4-e71f-43bb-9ff4-f1687b441980'
In [26]: vdisk.name
Out[26]: u'ovs-healthcheck-test-VQYqiMMeyXqCEyrG.raw'
In [27]: VDiskController.clean_vdisk_from_model(vdisk)

Additional information

Setup

Hyperconverged

Packages

ii  openvstorage                                          2.7.8-fargo.2-1                 amd64                           openvStorage
ii  openvstorage-backend                                  1.7.8-fargo.3-1                 amd64                           openvStorage Backend plugin
ii  openvstorage-backend-core                             1.7.8-fargo.3-1                 amd64                           openvStorage Backend plugin core
ii  openvstorage-backend-webapps                          1.7.8-fargo.3-1                 amd64                           openvStorage Backend plugin Web Applications
ii  openvstorage-core                                     2.7.8-fargo.2-1                 amd64                           openvStorage core
ii  openvstorage-hc                                       1.7.8-fargo.3-1                 amd64                           openvStorage Backend plugin HyperConverged
ii  openvstorage-health-check                             3.1.3-fargo.1-1                 amd64                           Open vStorage HealthCheck
ii  openvstorage-sdm                                      1.6.8-fargo.1-1                 amd64                           Open vStorage Backend ASD Manager
ii  openvstorage-webapps                                  2.7.8-fargo.2-1                 amd64                           openvStorage Web Applications

wimpers commented 7 years ago

Check if the healthcheck volume is still present in the model and if needed deleted the volume.

What is the root cause for the delete not working as intended? Fixing the root cause would be the best solution. It would make more sense for the health check to check after it deleted the disk that everything was deleted correctly.

jeroenmaelbrancke commented 7 years ago

https://github.com/openvstorage/framework/issues/1247

jeroenmaelbrancke commented 7 years ago

ovs-volumedriver:

Jan 02 15:36:09 stor-01.cl-g8-uk1 volumedriver_fs.sh[12576]: 2017-01-02 15:36:09 068905 +0100 - stor-01.cl-g8-uk1 - 12576/0x00007fe9167f0700 - volumedriverfs/MetaDataServerTable - 00000000001379c0 - info - ~Table: b5009c7d-4517-41b8-950e-
bba3d5113205: bye
Jan 02 15:36:09 stor-01.cl-g8-uk1 volumedriver_fs.sh[12576]: 2017-01-02 15:36:09 069558 +0100 - stor-01.cl-g8-uk1 - 12576/0x00007fe9167f0700 - volumedriverfs/RocksLogger - 00000000001379c1 - info - /mnt/ssd2/vmstor_db_mds_1: EVENT_LOG_v1 
{"time_micros": 1483367769069541, "job": 0, "event": "table_file_deletion", "file_number": 9204}
Jan 02 15:36:09 stor-01.cl-g8-uk1 volumedriver_fs.sh[12576]: 2017-01-02 15:36:09 069649 +0100 - stor-01.cl-g8-uk1 - 12576/0x00007fe9167f0700 - volumedriverfs/DataStoreNG - 00000000001379c2 - info - destroy: b5009c7d-4517-41b8-950e-bba3d51
13205: destroying DataStore, DeleteLocalData::T
Jan 02 15:36:09 stor-01.cl-g8-uk1 volumedriver_fs.sh[12576]: 2017-01-02 15:36:09 069800 +0100 - stor-01.cl-g8-uk1 - 12576/0x00007fe9167f0700 - volumedriverfs/SCOCacheMountPoint - 00000000001379c3 - info - removeNamespace: "/mnt/ssd1/vmsto
r_write_sco_1": removing namespace b5009c7d-4517-41b8-950e-bba3d5113205 from mountpoint
Jan 02 15:36:09 stor-01.cl-g8-uk1 volumedriver_fs.sh[12576]: 2017-01-02 15:36:09 069981 +0100 - stor-01.cl-g8-uk1 - 12576/0x00007fe9167f0700 - volumedriverfs/Volume - 00000000001379c4 - info - destroy: b5009c7d-4517-41b8-950e-bba3d5113205
: Unregistering volume from ClusterCache
Jan 02 15:36:09 stor-01.cl-g8-uk1 volumedriver_fs.sh[12576]: 2017-01-02 15:36:09 070038 +0100 - stor-01.cl-g8-uk1 - 12576/0x00007fe9167f0700 - volumedriverfs/BackendConnectionInterfaceLogger - 00000000001379c5 - info - Logger: Entering de
leteNamespace b5009c7d-4517-41b8-950e-bba3d5113205
Jan 02 15:36:09 stor-01.cl-g8-uk1 volumedriver_fs.sh[12576]: 2017-01-02 15:36:09 077976 +0100 - stor-01.cl-g8-uk1 - 12576/0x00007fe9167f0700 - volumedriverfs/BackendConnectionInterfaceLogger - 00000000001379c6 - info - ~Logger: Exiting de
leteNamespace for b5009c7d-4517-41b8-950e-bba3d5113205
Jan 02 15:36:09 stor-01.cl-g8-uk1 volumedriver_fs.sh[12576]: 2017-01-02 15:36:09 078017 +0100 - stor-01.cl-g8-uk1 - 12576/0x00007fe9167f0700 - volumedriverfs/VolManager - 00000000001379c7 - notice - Destroy Volume, VolumeId: b5009c7d-4517
-41b8-950e-bba3d5113205, delete local data: DeleteLocalData::T, remove volume completely RemoveVolumeCompletely::T, delete namespace DeleteVolumeNamespace::T, force deletion ForceVolumeDeletion::F, FINISHED

ovs-workers:

Jan 02 15:36:09 stor-01.cl-g8-uk1 celery[32628]: 2017-01-02 15:36:09 08100 +0100 - stor-01.cl-g8-uk1 - 28603/140015761745664 - celery/celery.redirected - 187059 - WARNING - 2017-01-02 15:36:09 08100 +0100 - stor-01.cl-g8-uk1 - 28603/140015761745664 - log/volumedriver_task - 187058 - INFO - [ovs.lib.vdisk.delete_from_voldrv] - ["b5009c7d-4517-41b8-950e-bba3d5113205"] - {} - {}

In the lib.log file i see a lot of connection refused with healthcheck volumes:

2017-01-02 15:36:04 50900 +0100 - stor-01.cl-g8-uk1 - 29717/140184351565568 - lib/mds - 178 - DEBUG - MDS safety: vDisk 64720c77-4fc1-4f31-b0b6-09dec7b5663b: Start checkup for virtual disk ovs-healthcheck-test-mSq1IJZLFbBzbDCQ.raw
2017-01-02 15:36:04 55200 +0100 - stor-01.cl-g8-uk1 - 29717/140184351565568 - lib/mds - 179 - DEBUG - MDS safety: vDisk 64720c77-4fc1-4f31-b0b6-09dec7b5663b: Reconfiguration required. Reasons:
2017-01-02 15:36:04 55300 +0100 - stor-01.cl-g8-uk1 - 29717/140184351565568 - lib/mds - 180 - DEBUG - MDS safety: vDisk 64720c77-4fc1-4f31-b0b6-09dec7b5663b:    * Not enough safety
2017-01-02 15:36:04 55300 +0100 - stor-01.cl-g8-uk1 - 29717/140184351565568 - lib/mds - 181 - DEBUG - MDS safety: vDisk 64720c77-4fc1-4f31-b0b6-09dec7b5663b:    * Not enough services in use in primary domain
2017-01-02 15:36:05 16600 +0100 - stor-01.cl-g8-uk1 - 29717/140184351565568 - lib/vdisk - 182 - ERROR - Got failure during (re)configuration of vDisk ovs-healthcheck-test-mSq1IJZLFbBzbDCQ.raw
Traceback (most recent call last):
  File "/opt/OpenvStorage/ovs/lib/vdisk.py", line 645, in create_new
    MDSServiceController.ensure_safety(new_vdisk)
  File "/opt/OpenvStorage/ovs/lib/mdsservice.py", line 616, in ensure_safety
    client.create_namespace(str(vdisk.volume_id))
RuntimeError: Connection refused

wimpers commented 7 years ago

Is this still an issue I believe we removed the check of the voldrv by making vols?

kinvaris commented 7 years ago

Linked to https://github.com/openvstorage/openvstorage-health-check/issues/259

JeffreyDevloo commented 7 years ago

We are currently working around the problem by disabling the volumedriver test. However the core issue should still be investigated and thats why this ticket should remain open.

JeffreyDevloo commented 7 years ago

https://github.com/openvstorage/framework/issues/1390 - providing a unique id to every disk has proven to no longer produce these issues. However the core issue is still being worked won (see linked ticket)

wimpers commented 7 years ago

Issue no longer present. Root cause fix in linked ticket.

openvstorage / openvstorage-health-check