scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
55 stars 93 forks source link

publish_event_guaranteed raising timeout that fails tests #1803

Closed fruch closed 4 years ago

fruch commented 4 years ago

Prerequisites

Versions

Logs

[2020-02-11T16:04:34.093Z] ======================================================================

[2020-02-11T16:04:34.093Z] ERROR: test_write (performance_regression_alternator_test.PerformanceRegressionAlternatorTest)

[2020-02-11T16:04:34.093Z] ----------------------------------------------------------------------

[2020-02-11T16:04:34.093Z] Traceback (most recent call last):

[2020-02-11T16:04:34.093Z]   File "/sct/performance_regression_alternator_test.py", line 33, in test_write

[2020-02-11T16:04:34.093Z]     results = self.get_stress_results(queue=stress_queue)

[2020-02-11T16:04:34.093Z]   File "/sct/sdcm/tester.py", line 837, in get_stress_results

[2020-02-11T16:04:34.093Z]     results = queue.get_results()

[2020-02-11T16:04:34.093Z]   File "/sct/sdcm/utils/thread.py", line 76, in get_results

[2020-02-11T16:04:34.093Z]     results.append(future.result())

[2020-02-11T16:04:34.093Z]   File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 425, in result

[2020-02-11T16:04:34.093Z]     return self.__get_result()

[2020-02-11T16:04:34.093Z]   File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result

[2020-02-11T16:04:34.093Z]     raise self._exception

[2020-02-11T16:04:34.094Z]   File "/usr/lib64/python3.6/concurrent/futures/thread.py", line 56, in run

[2020-02-11T16:04:34.094Z]     result = self.fn(*self.args, **self.kwargs)

[2020-02-11T16:04:34.094Z]   File "/sct/sdcm/ycsb_thread.py", line 188, in _run_stress

[2020-02-11T16:04:34.094Z]     YcsbStressEvent('start', node=loader, stress_cmd=stress_cmd)

[2020-02-11T16:04:34.094Z]   File "/sct/sdcm/sct_events.py", line 475, in __init__

[2020-02-11T16:04:34.094Z]     self.publish()

[2020-02-11T16:04:34.094Z]   File "/sct/sdcm/sct_events.py", line 188, in publish

[2020-02-11T16:04:34.094Z]     return EVENTS_PROCESSES['MainDevice'].publish_event_guaranteed(self)

[2020-02-11T16:04:34.094Z]   File "/sct/sdcm/utils/common.py", line 110, in inner

[2020-02-11T16:04:34.094Z]     return func(*args, **kwargs)

[2020-02-11T16:04:34.094Z]   File "/sct/sdcm/sct_events.py", line 149, in publish_event_guaranteed

[2020-02-11T16:04:34.094Z]     raise TimeoutError()

[2020-02-11T16:04:34.094Z] TimeoutError

[2020-02-11T16:04:34.094Z] 

[2020-02-11T16:04:34.094Z] ----------------------------------------------------------------------

Description

Recently I'm this TimeoutError from publish_event_guaranteed in lot of the test I'm doing for alternator

maybe we should change it to be an Error print to the log only ?

bentsi commented 4 years ago

Dmitry addressed this in #1757 @dkropachev let's separate your fix from the mentioned PR and create PR for the fix only

fruch commented 4 years ago

@bentsi the fix there is only on the StartupTestEvent my issues were seen long after the startup... I think we shouldn't be reusing the socket, since we are raising event from multiple threads.

http://zguide.zeromq.org/page:all#Multithreading-with-ZeroMQ has this nice warnning:

Don't share ZeroMQ sockets between threads. ZeroMQ sockets are not threadsafe. Technically it's possible to migrate a socket from one thread to another but it demands skill. The only place where it's remotely sane to share sockets between threads are in language bindings that need to do magic like garbage collection on sockets.

dkropachev commented 4 years ago

@bentsi the fix there is only on the StartupTestEvent my issues were seen long after the startup... I think we shouldn't be reusing the socket, since we are raising event from multiple threads.

http://zguide.zeromq.org/page:all#Multithreading-with-ZeroMQ has this nice warnning:

Don't share ZeroMQ sockets between threads. ZeroMQ sockets are not threadsafe. Technically it's possible to migrate a socket from one thread to another but it demands skill. The only place where it's remotely sane to share sockets between threads are in language bindings that need to do magic like garbage collection on sockets.

That is correct, I will investigate on how we can address this issue, either we will make reuse socket smarter or stop using it completely.

fruch commented 4 years ago

The reason for the reuse was speed ? if yes, my vote is lets stop using it.

dkropachev commented 4 years ago

Israel, yap speed is the reason? Here is PR - https://github.com/scylladb/scylla-cluster-tests/pull/1804