redpanda-data / redpanda

Redpanda is a streaming data platform for developers. Kafka API compatible. 10x faster. No ZooKeeper. No JVM!
https://redpanda.com
9.49k stars 580 forks source link

CI Failure (Error reading SSH protocol banner) in `NodesDecommissioningTest.test_decommissioning_cancel_ongoing_movements` #19915

Closed vbotbuildovich closed 1 week ago

vbotbuildovich commented 3 months ago

https://buildkite.com/redpanda/vtools/builds/14747

Module: rptest.tests.nodes_decommissioning_test
Class: NodesDecommissioningTest
Method: test_decommissioning_cancel_ongoing_movements
test_id:    NodesDecommissioningTest.test_decommissioning_cancel_ongoing_movements
status:     FAIL
run time:   47.585 seconds

SSHException('Error reading SSH protocol banner')
Traceback (most recent call last):
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/paramiko/transport.py", line 2327, in _check_banner
    buf = self.packetizer.readline(timeout)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/paramiko/packet.py", line 381, in readline
    buf += self._read_timeout(timeout)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/paramiko/packet.py", line 618, in _read_timeout
    raise EOFError()
EOFError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 276, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 105, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/tests/nodes_decommissioning_test.py", line 380, in test_decommissioning_cancel_ongoing_movements
    self.start_producer()
  File "/home/ubuntu/redpanda/tests/rptest/tests/nodes_decommissioning_test.py", line 272, in start_producer
    self.producer.start(clean=False)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/services/service.py", line 265, in start
    self.start_node(node, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/services/kgo_verifier_services.py", line 672, in start_node
    self.spawn(cmd, node)
  File "/home/ubuntu/redpanda/tests/rptest/services/kgo_verifier_services.py", line 155, in spawn
    pid_str = node.account.ssh_output(wrapped_cmd, timeout_sec=10)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/cluster/remoteaccount.py", line 41, in wrapper
    return method(self, *args, **kwargs)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/cluster/remoteaccount.py", line 381, in ssh_output
    client = self.ssh_client
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/cluster/remoteaccount.py", line 217, in ssh_client
    self._set_ssh_client()
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/cluster/remoteaccount.py", line 41, in wrapper
    return method(self, *args, **kwargs)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/cluster/remoteaccount.py", line 192, in _set_ssh_client
    client.connect(
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/paramiko/client.py", line 450, in connect
    t.start_client(timeout=timeout)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/paramiko/transport.py", line 738, in start_client
    raise e
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/paramiko/transport.py", line 2143, in run
    self._check_banner()
paramiko.ssh_exception.SSHException: Error reading SSH protocol banner

JIRA Link: CORE-4241

travisdowns commented 3 months ago

This error has been happening across a few tests with no real pattern.

This "Error reading SSH protocol banner" generally means that we can't connect to the host, in this case a CDT host, over ssh. This has happened in the past for various reasons, see e.g., https://github.com/redpanda-data/redpanda/issues/6967 and https://github.com/redpanda-data/redpanda/issues/7086.

Ducktape has an "ssh checker" thing we can enable which does some diagnostics in cases like this. We can also retry. We can also try to extract the sshd log on the target node (assuming ssh does come back) to see what it says.

In one of the causes of this was random public traffic hitting port 22 causing sshd to reject connections with some probability but currently we limit port 22 access to the IP of the machine running DT (but we should double check this) so we should not see random attempts on this port.

piyushredpanda commented 3 months ago

@twmb this seems like an infra thing we'd help on.

github-actions[bot] commented 1 week ago

This issue hasn't seen activity in 3 months. If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in two weeks.

piyushredpanda commented 1 week ago

Closing older-bot-filed CI issues as we transition to a more reliable system.