Closed kinvaris closed 7 years ago
We work around this in our PyrakoonStore
wrapper, but not in our configuration client. I'll move some code around so we work around this everywhere.
Also seen during creation of a vpool: TASK DETAILS:
Received 2016-09-19 15:13:18.085910 Started 2016-09-19 15:13:18.088244 Failed 2016-09-19 15:44:26.453361
Seen this failure during create vpool:
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 240, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 438, in __protected_call__
return self.run(*args, **kwargs)
File "/opt/OpenvStorage/ovs/lib/storagerouter.py", line 1000, in add_vpool
raise RuntimeError('StorageDriver service failed to start (got no event)')
RuntimeError: StorageDriver service failed to start (got no event)
Errors seen in the volumedriver:
016-09-19 16:51:51 615382 +0200 - perf-grav-05 - 88013/0x00007f101effd700 - volumedriverfs/MDSMetaDataBackendConfig - 00000000000040ff - error - MDSMetaDataBackendConfig: Empty node configs are not permitted
2016-09-19 16:51:51 615462 +0200 - perf-grav-05 - 88013/0x00007f101effd700 - volumedriverfs/FuseInterface - 0000000000004100 - error - convert_exceptions: "/test.raw": caught fungi::IOException: Empty node configs are not permitted , code 0
As seen in the volumedriver, config it is indeed missing: (compared with other, working vpool)
In [5]: Configuration.get('/ovs/vpools/7647d22b-421a-40e7-a596-42488ebf37e5/hosts/all-hdd-grav9xUeLgO1qkPk9MvP/config')['filesystem']
Out[5]:
{u'fs_dtl_config_mode': u'Automatic',
u'fs_dtl_mode': u'Asynchronous',
u'fs_enable_network_interface': 1,
u'fs_enable_shm_interface': 0,
u'fs_metadata_backend_arakoon_cluster_nodes': [],
u'fs_metadata_backend_mds_nodes': [], <---------------------- not working
u'fs_metadata_backend_type': u'MDS',
u'fs_raw_disk_suffix': u'.raw',
u'fs_virtual_disk_format': u'raw'}
In [6]: Configuration.get('/ovs/vpools/0c5acb0e-9a53-491e-b78a-f81ec7a83698/hosts/all-flash-grav9xUeLgO1qkPk9MvP/config')['filesystem']
Out[6]:
{u'fs_dtl_config_mode': u'Automatic',
u'fs_dtl_mode': u'Asynchronous',
u'fs_enable_network_interface': 1,
u'fs_enable_shm_interface': 0,
u'fs_metadata_backend_arakoon_cluster_nodes': [],
u'fs_metadata_backend_mds_nodes': [{u'host': u'172.20.20.131', <---------------------- working
u'port': 26301}],
u'fs_metadata_backend_type': u'MDS',
u'fs_raw_disk_suffix': u'.raw',
u'fs_virtual_disk_format': u'raw'}
The required event is only 1.5 hours late: (MASTER-roub01)
2016-09-19 15:13:47 58100 +0200 - ctrl-roub-01 - 7460/140306481833792 - log/volumedriver_event - 328 - INFO - [ovs.lib.vpool.up_and_running] - {"storagedriver_id": "all-hdd-grav9xUeLgO1qkPk9MvP"} - {"storagedriver": "a35de441-96d3-478d-8c84-079efa4dea91"}
2016-09-19 15:13:47 58200 +0200 - ctrl-roub-01 - 7460/140306481833792 - extensions/processor - 329 - INFO - [volumerouter] up_and_running({"storagedriver_id": "all-hdd-grav9xUeLgO1qkPk9MvP"}) started on sr.9xUeLgO1qkPk9MvP with taskid 235eae93-40c1-4291-be14-d78976f50b71. Delay: 0s
It seems like the volumedriver only started @:
Mon Sep 19 15:13:47 CEST 2016 [88013]: No officially blessed alternative allocator found, using libc's allocator
Mon Sep 19 15:13:47 CEST 2016 [88013]: args: -f --config arakoon://config/ovs/vpools/7647d22b-421a-40e7-a596-42488ebf37e5/hosts/all-hdd-grav9xUeLgO1qkPk9MvP/config?ini=%2Fopt%2FOpenvStorage%2Fconfig%2Farakoon_cacc.ini --lock-file /opt/OpenvStorage/run/storagedriver_all-hdd-grav.lock --logrotation --mountpoint /mnt/all-hdd-grav --logsink console: -o big_writes -o sync_read -o allow_other -o use_ino -o default_permissions -o uid=1000 -o gid=1000 -o umask=0002
libust[88013/88013]: Warning: HOME environment variable not set. Disabling LTTng-UST per-user tracing. (in setup_local_apps() at lttng-ust-comm.c:305)
2016-09-19 15:13:47 207456 +0200 - perf-grav-05 - 88013/0x00007f1127d41940 - volumedriverfs/volumedriver_fs - 0000000000000000 - notice - print_version_data: version: 6.0.4
version revision: cf03dad081dcc113f6121a6c912c8a755d8949d6
branch: dev
md5 support: no
revision: 7a8da553c2d2cfc98d25c33c38306f2911ed72cc
build time: Tue Sep 6 00:33:42 UTC 2016
Seems like the vpool_up_and_running tasks failed every time due to socket send error:
2016-09-19 15:13:47 70400 +0200 - perf-grav-05 - 96030/140115547637568 - celery/celery.worker.job - 29209 - ERROR - Task ovs.vpool.up_and_running[235eae93-40c1-4291-be14-d78976f50b71] raised unexpected: ArakoonSockS
endError('Error while sending data on socket',)
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 240, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 438, in __protected_call__
return self.run(*args, **kwargs)
File "/opt/OpenvStorage/ovs/lib/helpers/decorators.py", line 56, in new_function
metadata = {'storagedriver': StorageDriverList.get_by_storagedriver_id(kwargs['storagedriver_id']).guid}
File "/opt/OpenvStorage/ovs/dal/lists/storagedriverlist.py", line 43, in get_by_storagedriver_id
'items': [('storagedriver_id', DataList.operator.EQUALS, storagedriver_id)]})
File "/opt/OpenvStorage/ovs/dal/datalist.py", line 82, in __init__
self._volatile = VolatileFactory.get_client()
File "/opt/OpenvStorage/ovs/extensions/storage/volatilefactory.py", line 38, in get_client
client_type = Configuration.get('/ovs/framework/stores|volatile')
File "/opt/OpenvStorage/ovs/extensions/generic/configuration.py", line 112, in get
data = Configuration._get(key_entries[0], raw)
File "/opt/OpenvStorage/ovs/extensions/generic/configuration.py", line 362, in _get
key=key)
File "/opt/OpenvStorage/ovs/extensions/generic/configuration.py", line 404, in _passthrough
return getattr(ArakoonConfiguration, method)(*args, **kwargs)
File "/opt/OpenvStorage/ovs/extensions/db/arakoon/configuration.py", line 39, in new_function
return f(*args, **kw)
File "/opt/OpenvStorage/ovs/extensions/db/arakoon/configuration.py", line 142, in get
return client.get(key)
File "<update_argspec>", line 5, in get
File "/opt/OpenvStorage/ovs/extensions/db/arakoon/pyrakoon/pyrakoon/compat.py", line 145, in wrapped
return fun(*args, **kwargs)
File "/opt/OpenvStorage/ovs/extensions/db/arakoon/pyrakoon/pyrakoon/compat.py", line 128, in wrapped
return fun(*new_args)
File "/opt/OpenvStorage/ovs/extensions/db/arakoon/pyrakoon/pyrakoon/compat.py", line 246, in get
return self._client.get(key,consistency = consistency_)
File "<update_argspec>", line 5, in get
File "/opt/OpenvStorage/ovs/extensions/db/arakoon/pyrakoon/pyrakoon/client/utils.py", line 99, in wrapped
return self._process(message) #pylint: disable=W0212
File "/opt/OpenvStorage/ovs/extensions/db/arakoon/pyrakoon/pyrakoon/compat.py", line 1099, in _process
connection = self._send_to_master(bytes_)
File "/opt/OpenvStorage/ovs/extensions/db/arakoon/pyrakoon/pyrakoon/compat.py", line 1160, in _send_to_master
connection = self._send_message(self.master_id, data)
File "/opt/OpenvStorage/ovs/extensions/db/arakoon/pyrakoon/pyrakoon/compat.py", line 1138, in _send_message
connection.send(data)
File "/opt/OpenvStorage/ovs/extensions/db/arakoon/pyrakoon/pyrakoon/compat.py", line 1310, in send
raise ArakoonSockSendError
ArakoonSockSendError: Error while sending data on socket
Fixed by #930 and openvstorage/alba-asdmanager#113, packaged in openvstorage-2.7.3-rev.3990.68491c4 and openvstorage-sdm-1.6.3-rev.359.418f656
Seeing as the faliure on ovh happend on about 4/24 snapshots every day, the ratio of failure is 1/6. This means that taking 36 snapshots would guarantee (statistically) on a failure.
All snapshots were taken without a single error.
Test passed.
Hyperconverged setup
Tested with wrong packages, reopening. Could potentially be a latency issue and must be tested on a bigger scaled environment.
This issue was found on our OVH environment. As of now this environment has been updated. Checking the status of its snapshots would provide enough information whether this issue has been resolved or not
OVH setup
All snapshots are taken: Could not find the error trace as in the issue description
Test passed.
Many tasks are failing because of the following error during a fetch of a config file: