openvstorage / framework

The Framework is a set of components and tools which brings the user an interface (GUI / API) to setup, extend and manage an Open vStorage platform.
Other
27 stars 23 forks source link

"ArakoonSockSendError: Error while sending data on socket" during fetch config #927

Closed kinvaris closed 7 years ago

kinvaris commented 7 years ago

Many tasks are failing because of the following error during a fetch of a config file:

Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/opt/OpenvStorage/ovs/lib/helpers/decorators.py", line 168, in new_function
    persistent_client = PersistentFactory.get_client()
  File "/opt/OpenvStorage/ovs/extensions/storage/persistentfactory.py", line 40, in get_client
    client_type = Configuration.get('/ovs/framework/stores|persistent')
  File "/opt/OpenvStorage/ovs/extensions/generic/configuration.py", line 112, in get
    data = Configuration._get(key_entries[0], raw)
  File "/opt/OpenvStorage/ovs/extensions/generic/configuration.py", line 362, in _get
    key=key)
  File "/opt/OpenvStorage/ovs/extensions/generic/configuration.py", line 404, in _passthrough
    return getattr(ArakoonConfiguration, method)(*args, **kwargs)
  File "/opt/OpenvStorage/ovs/extensions/db/arakoon/configuration.py", line 39, in new_function
    return f(*args, **kw)
  File "/opt/OpenvStorage/ovs/extensions/db/arakoon/configuration.py", line 142, in get
    return client.get(key)
  File "<update_argspec>", line 5, in get
  File "/opt/OpenvStorage/ovs/extensions/db/arakoon/pyrakoon/pyrakoon/compat.py", line 145, in wrapped
    return fun(*args, **kwargs)
  File "/opt/OpenvStorage/ovs/extensions/db/arakoon/pyrakoon/pyrakoon/compat.py", line 128, in wrapped
    return fun(*new_args)
  File "/opt/OpenvStorage/ovs/extensions/db/arakoon/pyrakoon/pyrakoon/compat.py", line 246, in get
    return self._client.get(key,consistency = consistency_)
  File "<update_argspec>", line 5, in get
  File "/opt/OpenvStorage/ovs/extensions/db/arakoon/pyrakoon/pyrakoon/client/utils.py", line 99, in wrapped
    return self._process(message) #pylint: disable=W0212
  File "/opt/OpenvStorage/ovs/extensions/db/arakoon/pyrakoon/pyrakoon/compat.py", line 1099, in _process
    connection = self._send_to_master(bytes_)
  File "/opt/OpenvStorage/ovs/extensions/db/arakoon/pyrakoon/pyrakoon/compat.py", line 1160, in _send_to_master
    connection = self._send_message(self.master_id, data)
  File "/opt/OpenvStorage/ovs/extensions/db/arakoon/pyrakoon/pyrakoon/compat.py", line 1138, in _send_message
    connection.send(data)
  File "/opt/OpenvStorage/ovs/extensions/db/arakoon/pyrakoon/pyrakoon/compat.py", line 1310, in send
    raise ArakoonSockSendError
ArakoonSockSendError: Error while sending data on socket
khenderick commented 7 years ago

We work around this in our PyrakoonStore wrapper, but not in our configuration client. I'll move some code around so we work around this everywhere.

kinvaris commented 7 years ago

Also seen during creation of a vpool: TASK DETAILS:

Received 2016-09-19 15:13:18.085910 Started 2016-09-19 15:13:18.088244 Failed 2016-09-19 15:44:26.453361

Seen this failure during create vpool:

Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/opt/OpenvStorage/ovs/lib/storagerouter.py", line 1000, in add_vpool
    raise RuntimeError('StorageDriver service failed to start (got no event)')
RuntimeError: StorageDriver service failed to start (got no event)

Errors seen in the volumedriver:

016-09-19 16:51:51 615382 +0200 - perf-grav-05 - 88013/0x00007f101effd700 - volumedriverfs/MDSMetaDataBackendConfig - 00000000000040ff - error - MDSMetaDataBackendConfig: Empty node configs are not permitted
2016-09-19 16:51:51 615462 +0200 - perf-grav-05 - 88013/0x00007f101effd700 - volumedriverfs/FuseInterface - 0000000000004100 - error - convert_exceptions: "/test.raw": caught fungi::IOException: Empty node configs are not permitted , code 0

As seen in the volumedriver, config it is indeed missing: (compared with other, working vpool)

In [5]: Configuration.get('/ovs/vpools/7647d22b-421a-40e7-a596-42488ebf37e5/hosts/all-hdd-grav9xUeLgO1qkPk9MvP/config')['filesystem']
Out[5]: 
{u'fs_dtl_config_mode': u'Automatic',
 u'fs_dtl_mode': u'Asynchronous',
 u'fs_enable_network_interface': 1,
 u'fs_enable_shm_interface': 0,
 u'fs_metadata_backend_arakoon_cluster_nodes': [],
 u'fs_metadata_backend_mds_nodes': [], <---------------------- not working
 u'fs_metadata_backend_type': u'MDS',
 u'fs_raw_disk_suffix': u'.raw',
 u'fs_virtual_disk_format': u'raw'}

In [6]: Configuration.get('/ovs/vpools/0c5acb0e-9a53-491e-b78a-f81ec7a83698/hosts/all-flash-grav9xUeLgO1qkPk9MvP/config')['filesystem']
Out[6]: 
{u'fs_dtl_config_mode': u'Automatic',
 u'fs_dtl_mode': u'Asynchronous',
 u'fs_enable_network_interface': 1,
 u'fs_enable_shm_interface': 0,
 u'fs_metadata_backend_arakoon_cluster_nodes': [],
 u'fs_metadata_backend_mds_nodes': [{u'host': u'172.20.20.131', <---------------------- working
   u'port': 26301}],
 u'fs_metadata_backend_type': u'MDS',
 u'fs_raw_disk_suffix': u'.raw',
 u'fs_virtual_disk_format': u'raw'}

The required event is only 1.5 hours late: (MASTER-roub01)

2016-09-19 15:13:47 58100 +0200 - ctrl-roub-01 - 7460/140306481833792 - log/volumedriver_event - 328 - INFO - [ovs.lib.vpool.up_and_running] - {"storagedriver_id": "all-hdd-grav9xUeLgO1qkPk9MvP"} - {"storagedriver": "a35de441-96d3-478d-8c84-079efa4dea91"}
2016-09-19 15:13:47 58200 +0200 - ctrl-roub-01 - 7460/140306481833792 - extensions/processor - 329 - INFO - [volumerouter] up_and_running({"storagedriver_id": "all-hdd-grav9xUeLgO1qkPk9MvP"}) started on sr.9xUeLgO1qkPk9MvP with taskid 235eae93-40c1-4291-be14-d78976f50b71. Delay: 0s

It seems like the volumedriver only started @:

Mon Sep 19 15:13:47 CEST 2016 [88013]: No officially blessed alternative allocator found, using libc's allocator
Mon Sep 19 15:13:47 CEST 2016 [88013]: args: -f --config arakoon://config/ovs/vpools/7647d22b-421a-40e7-a596-42488ebf37e5/hosts/all-hdd-grav9xUeLgO1qkPk9MvP/config?ini=%2Fopt%2FOpenvStorage%2Fconfig%2Farakoon_cacc.ini --lock-file /opt/OpenvStorage/run/storagedriver_all-hdd-grav.lock --logrotation --mountpoint /mnt/all-hdd-grav --logsink console: -o big_writes -o sync_read -o allow_other -o use_ino -o default_permissions -o uid=1000 -o gid=1000 -o umask=0002
libust[88013/88013]: Warning: HOME environment variable not set. Disabling LTTng-UST per-user tracing. (in setup_local_apps() at lttng-ust-comm.c:305)
2016-09-19 15:13:47 207456 +0200 - perf-grav-05 - 88013/0x00007f1127d41940 - volumedriverfs/volumedriver_fs - 0000000000000000 - notice - print_version_data: version: 6.0.4
version revision: cf03dad081dcc113f6121a6c912c8a755d8949d6
branch: dev
md5 support: no
revision: 7a8da553c2d2cfc98d25c33c38306f2911ed72cc
build time: Tue Sep  6 00:33:42 UTC 2016

Seems like the vpool_up_and_running tasks failed every time due to socket send error:

2016-09-19 15:13:47 70400 +0200 - perf-grav-05 - 96030/140115547637568 - celery/celery.worker.job - 29209 - ERROR - Task ovs.vpool.up_and_running[235eae93-40c1-4291-be14-d78976f50b71] raised unexpected: ArakoonSockS
endError('Error while sending data on socket',)
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/opt/OpenvStorage/ovs/lib/helpers/decorators.py", line 56, in new_function
    metadata = {'storagedriver': StorageDriverList.get_by_storagedriver_id(kwargs['storagedriver_id']).guid}
  File "/opt/OpenvStorage/ovs/dal/lists/storagedriverlist.py", line 43, in get_by_storagedriver_id
    'items': [('storagedriver_id', DataList.operator.EQUALS, storagedriver_id)]})
  File "/opt/OpenvStorage/ovs/dal/datalist.py", line 82, in __init__
    self._volatile = VolatileFactory.get_client()
  File "/opt/OpenvStorage/ovs/extensions/storage/volatilefactory.py", line 38, in get_client
    client_type = Configuration.get('/ovs/framework/stores|volatile')
  File "/opt/OpenvStorage/ovs/extensions/generic/configuration.py", line 112, in get
    data = Configuration._get(key_entries[0], raw)
  File "/opt/OpenvStorage/ovs/extensions/generic/configuration.py", line 362, in _get
    key=key)
  File "/opt/OpenvStorage/ovs/extensions/generic/configuration.py", line 404, in _passthrough
    return getattr(ArakoonConfiguration, method)(*args, **kwargs)
  File "/opt/OpenvStorage/ovs/extensions/db/arakoon/configuration.py", line 39, in new_function
    return f(*args, **kw)
  File "/opt/OpenvStorage/ovs/extensions/db/arakoon/configuration.py", line 142, in get
    return client.get(key)
  File "<update_argspec>", line 5, in get
  File "/opt/OpenvStorage/ovs/extensions/db/arakoon/pyrakoon/pyrakoon/compat.py", line 145, in wrapped
    return fun(*args, **kwargs)
  File "/opt/OpenvStorage/ovs/extensions/db/arakoon/pyrakoon/pyrakoon/compat.py", line 128, in wrapped
    return fun(*new_args)
  File "/opt/OpenvStorage/ovs/extensions/db/arakoon/pyrakoon/pyrakoon/compat.py", line 246, in get
    return self._client.get(key,consistency = consistency_)
  File "<update_argspec>", line 5, in get
  File "/opt/OpenvStorage/ovs/extensions/db/arakoon/pyrakoon/pyrakoon/client/utils.py", line 99, in wrapped
    return self._process(message) #pylint: disable=W0212
  File "/opt/OpenvStorage/ovs/extensions/db/arakoon/pyrakoon/pyrakoon/compat.py", line 1099, in _process
    connection = self._send_to_master(bytes_)
  File "/opt/OpenvStorage/ovs/extensions/db/arakoon/pyrakoon/pyrakoon/compat.py", line 1160, in _send_to_master
    connection = self._send_message(self.master_id, data)
  File "/opt/OpenvStorage/ovs/extensions/db/arakoon/pyrakoon/pyrakoon/compat.py", line 1138, in _send_message
    connection.send(data)
  File "/opt/OpenvStorage/ovs/extensions/db/arakoon/pyrakoon/pyrakoon/compat.py", line 1310, in send
    raise ArakoonSockSendError
ArakoonSockSendError: Error while sending data on socket
khenderick commented 7 years ago

Fixed by #930 and openvstorage/alba-asdmanager#113, packaged in openvstorage-2.7.3-rev.3990.68491c4 and openvstorage-sdm-1.6.3-rev.359.418f656

JeffreyDevloo commented 7 years ago

Steps

Seeing as the faliure on ovh happend on about 4/24 snapshots every day, the ratio of failure is 1/6. This means that taking 36 snapshots would guarantee (statistically) on a failure.

Output

All snapshots were taken without a single error.

Test result

Test passed.

Setup

Hyperconverged setup

Package information

JeffreyDevloo commented 7 years ago

Tested with wrong packages, reopening. Could potentially be a latency issue and must be tested on a bigger scaled environment.

JeffreyDevloo commented 7 years ago

Information

This issue was found on our OVH environment. As of now this environment has been updated. Checking the status of its snapshots would provide enough information whether this issue has been resolved or not

Steps:

OVH setup

Output

All snapshots are taken: selection_104 Could not find the error trace as in the issue description

Test result

Test passed.