redpanda-data / redpanda

Redpanda is a streaming data platform for developers. Kafka API compatible. 10x faster. No ZooKeeper. No JVM!
https://redpanda.com
9.23k stars 564 forks source link

CI Failure (key symptom) in `RedpandaUpgradeTest.test_workloads_through_releases` #18452

Closed vbotbuildovich closed 2 days ago

vbotbuildovich commented 2 months ago

https://buildkite.com/redpanda/vtools/builds/13676

Module: rptest.tests.workload_upgrade_runner_test
Class: RedpandaUpgradeTest
Method: test_workloads_through_releases
Arguments: {
    "cloud_storage_type": 1
}
test_id:    RedpandaUpgradeTest.test_workloads_through_releases
status:     FAIL
run time:   537.213 seconds

RemoteCommandError({'ssh_config': {'host': 'ip-172-31-11-46', 'hostname': '172.31.11.46', 'user': 'root', 'port': 22, 'password': None, 'identityfile': '/home/ubuntu/.ssh/id_rsa'}, 'hostname': 'ip-172-31-11-46', 'ssh_hostname': '172.31.11.46', 'user': 'root', 'externally_routable_ip': '35.162.166.49', '_logger': <Logger rptest.tests.workload_upgrade_runner_test.RedpandaUpgradeTest.test_workloads_through_releases.cloud_storage_type=CloudStorageType.S3-820 (DEBUG)>, 'os': 'linux', '_ssh_client': <paramiko.client.SSHClient object at 0x7fd8a5a22710>, '_sftp_client': <paramiko.sftp_client.SFTPClient object at 0x7fd8a5a4f100>, '_custom_ssh_exception_checks': None}, 'curl -fsSL https://vectorized-public.s3.us-west-2.amazonaws.com/releases/redpanda/23.3.15/redpanda-23.3.15-amd64.tar.gz --retry 3 --retry-connrefused --retry-delay 2 --create-dir -o /opt/redpanda_installs/v23.3.15/redpanda.tar.gz && gunzip -c /opt/redpanda_installs/v23.3.15/redpanda.tar.gz | tar -xf - -C /opt/redpanda_installs/v23.3.15 && rm /opt/redpanda_installs/v23.3.15/redpanda.tar.gz', 35, b'')
Traceback (most recent call last):
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 276, in run_test
    return self.test_context.function(self.test)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/mark/_mark.py", line 535, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 103, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/tests/workload_upgrade_runner_test.py", line 278, in test_workloads_through_releases
    for current_version in self.upgrade_through_versions(
  File "/home/ubuntu/redpanda/tests/rptest/tests/redpanda_test.py", line 247, in upgrade_through_versions
    current_version = install_next()
  File "/home/ubuntu/redpanda/tests/rptest/tests/redpanda_test.py", line 174, in install_next
    self.redpanda._installer.install(self.redpanda.nodes, v)
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda_installer.py", line 609, in install
    self._install_unlocked(nodes, install_target)
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda_installer.py", line 658, in _install_unlocked
    raise e
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda_installer.py", line 638, in _install_unlocked
    self.wait_for_async_ssh(self._redpanda.logger,
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda_installer.py", line 165, in wait_for_async_ssh
    for l in ssh_out_per_node[node]:
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/cluster/remoteaccount.py", line 687, in next
    return next(self.iter_obj)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/cluster/remoteaccount.py", line 354, in output_generator
    raise RemoteCommandError(self, cmd, exit_status, stderr.read())
ducktape.cluster.remoteaccount.RemoteCommandError: root@ip-172-31-11-46: Command 'curl -fsSL https://vectorized-public.s3.us-west-2.amazonaws.com/releases/redpanda/23.3.15/redpanda-23.3.15-amd64.tar.gz --retry 3 --retry-connrefused --retry-delay 2 --create-dir -o /opt/redpanda_installs/v23.3.15/redpanda.tar.gz && gunzip -c /opt/redpanda_installs/v23.3.15/redpanda.tar.gz | tar -xf - -C /opt/redpanda_installs/v23.3.15 && rm /opt/redpanda_installs/v23.3.15/redpanda.tar.gz' returned non-zero exit status 35.

JIRA Link: CORE-2940

rpdevmp commented 2 months ago

This one doesn't look like infra issue.. Originally that is what I thought, since it shows ssh issue and timeout of 20 seconds.

But I looked at this test code and the logic was not changed for almost two years.

It is related to upgrade, we wait for service to come up

Error message is: ducktape.errors.TimeoutError: Redpanda service ip-172-31-6-14 failed to start within 20 sec

We could increase a timeout and see what happens, but since this code is not new, looks like there is a possible degradation with the product.. I will assign it to myslef and investigate more. Let's see if we can reproduce it

rpdevmp commented 2 months ago

Duplicate of https://github.com/redpanda-data/redpanda/issues/13306

vbotbuildovich commented 1 month ago

https://buildkite.com/redpanda/vtools/builds/13866 https://buildkite.com/redpanda/vtools/builds/14212

vbotbuildovich commented 1 month ago

https://buildkite.com/redpanda/vtools/builds/14267 https://buildkite.com/redpanda/vtools/builds/14280

vbotbuildovich commented 1 month ago

*https://buildkite.com/redpanda/vtools/builds/14463

vbotbuildovich commented 1 month ago

*https://buildkite.com/redpanda/vtools/builds/14463

travisdowns commented 1 month ago

@rpdevmp wrote:

Error message is: ducktape.errors.TimeoutError: Redpanda service ip-172-31-6-14 failed to start within 20 sec

... but the error message from the stack in the top comment is:

ducktape.cluster.remoteaccount.RemoteCommandError: root@ip-172-31-11-46: Command 'curl -fsSL https://vectorized-public.s3.us-west-2.amazonaws.com/releases/redpanda/23.3.15/redpanda-23.3.15-amd64.tar.gz --retry 3 --retry-connrefused --retry-delay 2 --create-dir -o /opt/redpanda_installs/v23.3.15/redpanda.tar.gz && gunzip -c /opt/redpanda_installs/v23.3.15/redpanda.tar.gz | tar -xf - -C /opt/redpanda_installs/v23.3.15 && rm /opt/redpanda_installs/v23.3.15/redpanda.tar.gz' returned non-zero exit status 35.

This clearly looks like curl failing to connect to the s3 bucket (well it's hard to tell for sure because of the long pipeline but based on errors we see elsewhere the 35 is almost certain from curl).

So I don't understand the comment about Redpanda failing to start. Where do you see that?

I suspect these are duplicates of https://github.com/redpanda-data/redpanda/issues/18607. We are using curl here rather than requests, but the endpoint is the same and curl error 35 is an SSL-related error just like we got in Python.

vbotbuildovich commented 3 weeks ago

*https://buildkite.com/redpanda/vtools/builds/15156

vbotbuildovich commented 3 weeks ago

*https://buildkite.com/redpanda/vtools/builds/15156

michael-redpanda commented 2 days ago

Automatically closing issue to match current state of CORE-2940