scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
57 stars 94 forks source link

Process Network exception during node upgrade for nodetool drain #6869

Open aleksbykov opened 11 months ago

aleksbykov commented 11 months ago

Prerequisites

Versions

Logs

Description

during upgrade node, nodetool command nodetool drain executed. But the command raise exception and terminate job if network exception happened:

< t:2023-11-25 05:59:48,930 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2023-11-25 05:59:48.927: (NodetoolEvent Severity.ERROR) period_type=end event_id=1a075b52-c01b-497f-9a00-71bf99ae1e55 duration=10s: nodetool_command=drain node=rolling-upgrade--centos-db-node-927b91e5-0-1 errors=['', 'Traceback (most recent call last):\n  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 605, in _run\n    return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)\n  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 538, in _run_execute\n    result = connection.run(**command_kwargs)\n  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 620, in run\n    return self._complete_run(channel, exception, timeout_reached, timeout, result, warn, stdout, stderr)\n  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 650, in _complete_run\n    raise exception\n  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 77, in run\n    self._read_output(self._session, self._channel, self._timeout,\n  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 105, in _read_output\n    stdout_size, stdout_chunk = channel.read()\n  File "ssh2/channel.pyx", line 116, in ssh2.channel.Channel.read\n  File "ssh2/channel.pyx", line 144, in ssh2.channel.Channel.read_ex\n  File "ssh2/utils.pyx", line 214, in ssh2.utils.handle_error_codes\nssh2.exceptions.SocketRecvError\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2507, in run_nodetool\n    self.remoter.run(cmd, timeout=timeout, ignore_status=ignore_status, verbose=verbose, retry=retry)\n  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 614, in run\n    result = _run()\n  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 65, in inner\n    return func(*args, **kwargs)\n  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 607, in _run\n    if self._run_on_retryable_exception(exc, new_session):\n  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_libssh_cmd_runner.py", line 78, in _run_on_retryable_exception\n    raise RetryableNetworkException(str(exc), original=exc)\nsdcm.remote.base.RetryableNetworkException\n']
< t:2023-11-25 05:59:48,930 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Traceback (most recent call last):
< t:2023-11-25 05:59:48,930 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 605, in _run
< t:2023-11-25 05:59:48,930 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)
< t:2023-11-25 05:59:48,930 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 538, in _run_execute
< t:2023-11-25 05:59:48,930 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     result = connection.run(**command_kwargs)
< t:2023-11-25 05:59:48,930 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 620, in run
< t:2023-11-25 05:59:48,930 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     return self._complete_run(channel, exception, timeout_reached, timeout, result, warn, stdout, stderr)
< t:2023-11-25 05:59:48,930 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 650, in _complete_run
< t:2023-11-25 05:59:48,930 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     raise exception
< t:2023-11-25 05:59:48,930 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 77, in run
< t:2023-11-25 05:59:48,930 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     self._read_output(self._session, self._channel, self._timeout,
< t:2023-11-25 05:59:48,930 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 105, in _read_output
< t:2023-11-25 05:59:48,930 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     stdout_size, stdout_chunk = channel.read()
< t:2023-11-25 05:59:48,930 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "ssh2/channel.pyx", line 116, in ssh2.channel.Channel.read
< t:2023-11-25 05:59:48,930 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "ssh2/channel.pyx", line 144, in ssh2.channel.Channel.read_ex
< t:2023-11-25 05:59:48,930 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "ssh2/utils.pyx", line 214, in ssh2.utils.handle_error_codes
< t:2023-11-25 05:59:48,930 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > ssh2.exceptions.SocketRecvError
< t:2023-11-25 05:59:48,930 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
< t:2023-11-25 05:59:48,930 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > During handling of the above exception, another exception occurred:
< t:2023-11-25 05:59:48,930 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
< t:2023-11-25 05:59:48,930 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Traceback (most recent call last):
< t:2023-11-25 05:59:48,930 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2507, in run_nodetool
< t:2023-11-25 05:59:48,930 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     self.remoter.run(cmd, timeout=timeout, ignore_status=ignore_status, verbose=verbose, retry=retry)
< t:2023-11-25 05:59:48,930 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 614, in run
< t:2023-11-25 05:59:48,930 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     result = _run()
< t:2023-11-25 05:59:48,930 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 65, in inner
< t:2023-11-25 05:59:48,930 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     return func(*args, **kwargs)
< t:2023-11-25 05:59:48,930 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 607, in _run
< t:2023-11-25 05:59:48,930 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     if self._run_on_retryable_exception(exc, new_session):
< t:2023-11-25 05:59:48,930 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_libssh_cmd_runner.py", line 78, in _run_on_retryable_exception
< t:2023-11-25 05:59:48,930 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     raise RetryableNetworkException(str(exc), original=exc)
< t:2023-11-25 05:59:48,930 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > sdcm.remote.base.RetryableNetworkException

Steps to Reproduce

issue reproduced rarely. Latest was happened: https://argus.scylladb.com/test/3f60c813-ab37-4cea-b400-21e8d4fc6dc9/runs?additionalRuns[]=927b91e5-b96c-46e1-ae7e-a68f2510e9ed

as work around we need catch such exception/ or don't raise exception and use warning

fruch commented 11 months ago

Is this similar to https://github.com/scylladb/scylladb/issues/14004

I.e. can you see if we are getting SSH bad packet length ?

aleksbykov commented 11 months ago

It looks so, rerun job again failed. i will check the logs for issue #14004

aleksbykov commented 11 months ago

Issue reproduced again during: https://argus.scylladb.com/test/97a03f7d-fd85-468c-8c23-a465f7693243/runs?additionalRuns[]=8c1c057d-6956-4151-80c5-2fc102687cee

But on this time "sshd complains on Bad packet length" error was not found in logs.

Is it possible, that drain process could restart network interface or somehow close all connections to host?

roydahan commented 11 months ago

@aleksbykov if it's a scylla issue, please move your report to scylla and let's close this one.