redpanda-data / redpanda

Redpanda is a streaming data platform for developers. Kafka API compatible. 10x faster. No ZooKeeper. No JVM!
https://redpanda.com
9.41k stars 577 forks source link

CI Failure (key symptom) in `UpgradeBackToBackTest.test_upgrade_with_all_workloads` #21624

Open vbotbuildovich opened 1 month ago

vbotbuildovich commented 1 month ago

https://buildkite.com/redpanda/vtools/builds/15926

Module: rptest.tests.upgrade_test
Class: UpgradeBackToBackTest
Method: test_upgrade_with_all_workloads
Arguments: {
    "single_upgrade": false
}
test_id:    UpgradeBackToBackTest.test_upgrade_with_all_workloads
status:     FAIL
run time:   573.837 seconds

RemoteCommandError({'ssh_config': {'host': 'ducktape-node-10-amazingly-saving-quetzal', 'hostname': '10.168.0.124', 'user': 'root', 'port': 22, 'password': None, 'identityfile': '/home/ubuntu/.ssh/id_rsa'}, 'hostname': 'ducktape-node-10-amazingly-saving-quetzal', 'ssh_hostname': '10.168.0.124', 'user': 'root', 'externally_routable_ip': '34.102.33.140', '_logger': <Logger rptest.tests.upgrade_test.UpgradeBackToBackTest.test_upgrade_with_all_workloads.single_upgrade=False-490 (DEBUG)>, 'os': 'linux', '_ssh_client': <paramiko.client.SSHClient object at 0x79b870b9b6a0>, '_sftp_client': <paramiko.sftp_client.SFTPClient object at 0x79b870bbd3c0>, '_custom_ssh_exception_checks': None}, 'python3 /opt/scripts/offline_log_viewer/viewer.py --path /var/lib/redpanda/data --type controller_snapshot', 1, b'INFO:viewer:starting metadata viewer with options: Namespace(path=\'/var/lib/redpanda/data\', type=\'controller_snapshot\', topic=None, verbose=False, dump=False, force=False)\nTraceback (most recent call last):\n  File "/opt/scripts/offline_log_viewer/viewer.py", line 235, in <module>\n    main()\n  File "/opt/scripts/offline_log_viewer/viewer.py", line 215, in main\n    print_controller_snapshot(store, options.dump)\n  File "/opt/scripts/offline_log_viewer/viewer.py", line 70, in print_controller_snapshot\n    SerializableGenerator(snap.to_dict().items()))\n  File "/opt/scripts/offline_log_viewer/controller.py", line 1346, in to_dict\n    return self.parse_snapshot(sf)\n  File "/opt/scripts/offline_log_viewer/controller.py", line 1336, in parse_snapshot\n    data = reader.read_checksum_envelope(\n  File "/opt/scripts/offline_log_viewer/reader.py", line 139, in read_checksum_envelope\n    return self.read_envelope_inner(envelope, type_read, max_version)\n  File "/opt/scripts/offline_log_viewer/reader.py", line 144, in read_envelope_inner\n    v = type_read(self, envelope.version)\n  File "/opt/scripts/offline_log_viewer/controller.py", line 1337, in <lambda>\n    type_read=lambda r, _: self.read_snapshot(r), max_version=2)\n  File "/opt/scripts/offline_log_viewer/controller.py", line 1305, in read_snapshot\n    data[\'topics\'] = rdr.read_envelope(\n  File "/opt/scripts/offline_log_viewer/reader.py", line 133, in read_envelope\n    return self.read_envelope_inner(envelope, type_read, max_version)\n  File "/opt/scripts/offline_log_viewer/reader.py", line 144, in read_envelope_inner\n    v = type_read(self, envelope.version)\n  File "/opt/scripts/offline_log_viewer/controller.py", line 1306, in <lambda>\n    type_read=lambda r, v: self.read_topics(r, v), max_version=1)\n  File "/opt/scripts/offline_log_viewer/controller.py", line 1186, in read_topics\n    rdr.read_serde_map(\n  File "/opt/scripts/offline_log_viewer/reader.py", line 204, in read_serde_map\n    key = k_reader(self)\n  File "/opt/scripts/offline_log_viewer/controller.py", line 1134, in read_tp_ns_to_str\n    return f"{v[\'namespace\']}/{v[\'topic\']}"\nTypeError: string indices must be integers\n')
Traceback (most recent call last):
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 276, in run_test
    return self.test_context.function(self.test)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/mark/_mark.py", line 535, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 105, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/tests/upgrade_test.py", line 262, in test_upgrade_with_all_workloads
    controller_snapshot = log_viewer.read_controller_snapshot(
  File "/home/ubuntu/redpanda/tests/rptest/clients/offline_log_viewer.py", line 53, in read_controller_snapshot
    return self._json_cmd(node, "--type controller_snapshot")
  File "/home/ubuntu/redpanda/tests/rptest/clients/offline_log_viewer.py", line 34, in _json_cmd
    json_out = node.account.ssh_output(cmd, combine_stderr=False)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/cluster/remoteaccount.py", line 41, in wrapper
    return method(self, *args, **kwargs)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/cluster/remoteaccount.py", line 397, in ssh_output
    raise RemoteCommandError(self, cmd, exit_status, stderr.read())
ducktape.cluster.remoteaccount.RemoteCommandError: root@ducktape-node-10-amazingly-saving-quetzal: Command 'python3 /opt/scripts/offline_log_viewer/viewer.py --path /var/lib/redpanda/data --type controller_snapshot' returned non-zero exit status 1. Remote error message: b'INFO:viewer:starting metadata viewer with options: Namespace(path=\'/var/lib/redpanda/data\', type=\'controller_snapshot\', topic=None, verbose=False, dump=False, force=False)\nTraceback (most recent call last):\n  File "/opt/scripts/offline_log_viewer/viewer.py", line 235, in <module>\n    main()\n  File "/opt/scripts/offline_log_viewer/viewer.py", line 215, in main\n    print_controller_snapshot(store, options.dump)\n  File "/opt/scripts/offline_log_viewer/viewer.py", line 70, in print_controller_snapshot\n    SerializableGenerator(snap.to_dict().items()))\n  File "/opt/scripts/offline_log_viewer/controller.py", line 1346, in to_dict\n    return self.parse_snapshot(sf)\n  File "/opt/scripts/offline_log_viewer/controller.py", line 1336, in parse_snapshot\n    data = reader.read_checksum_envelope(\n  File "/opt/scripts/offline_log_viewer/reader.py", line 139, in read_checksum_envelope\n    return self.read_envelope_inner(envelope, type_read, max_version)\n  File "/opt/scripts/offline_log_viewer/reader.py", line 144, in read_envelope_inner\n    v = type_read(self, envelope.version)\n  File "/opt/scripts/offline_log_viewer/controller.py", line 1337, in <lambda>\n    type_read=lambda r, _: self.read_snapshot(r), max_version=2)\n  File "/opt/scripts/offline_log_viewer/controller.py", line 1305, in read_snapshot\n    data[\'topics\'] = rdr.read_envelope(\n  File "/opt/scripts/offline_log_viewer/reader.py", line 133, in read_envelope\n    return self.read_envelope_inner(envelope, type_read, max_version)\n  File "/opt/scripts/offline_log_viewer/reader.py", line 144, in read_envelope_inner\n    v = type_read(self, envelope.version)\n  File "/opt/scripts/offline_log_viewer/controller.py", line 1306, in <lambda>\n    type_read=lambda r, v: self.read_topics(r, v), max_version=1)\n  File "/opt/scripts/offline_log_viewer/controller.py", line 1186, in read_topics\n    rdr.read_serde_map(\n  File "/opt/scripts/offline_log_viewer/reader.py", line 204, in read_serde_map\n    key = k_reader(self)\n  File "/opt/scripts/offline_log_viewer/controller.py", line 1134, in read_tp_ns_to_str\n    return f"{v[\'namespace\']}/{v[\'topic\']}"\nTypeError: string indices must be integers\n'

JIRA Link: CORE-5780

rpdevmp commented 1 month ago

Many tests failed due to the same infra issue remote commands kept failing

RemoteCommandError({'ssh_config': {'host': 'ip-172-31-3-159', 'hostname': '172.31.3.159', 'user': 'root', 'port': 22, 'password': None, 'identityfile': '/home/ubuntu/.ssh/id_rsa'}, 'hostname': 'ip-172-31-3-159', 'ssh_hostname': '172.31.3.159', 'user': 'root', 'externally_routable_ip': '54.184.188.225',

Example Buildkite Job: https://buildkite.com/redpanda/vtools/builds/15928

Going to close others as duplicate of this issue

  1. insert results into analytics DB error that can bee seen in CI runs is already fixed
  2. Work in progress to improve PandaTriage logic to be able to group issues based on root cause and avoid open GH issues for each test in case of common infra issue (for example)
rpdevmp commented 1 month ago

Also, in some tests additonal error is present:

Example Buildkite Job: https://buildkite.com/redpanda/vtools/builds/15928

ClientError('An error occurred (AuthenticationRequired) when calling the ListBuckets operation: Authentication required.')
Traceback (most recent call last):
  File "/home/ubuntu/redpanda/tests/rptest/archival/s3_client.py", line 530, in list_objects
    res = self._list_objects(bucket=bucket,
  File "/home/ubuntu/redpanda/tests/rptest/archival/s3_client.py", line 47, in do_retry
    return fn(*args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/archival/s3_client.py", line 503, in _list_objects
    return client.list_objects_v2(Bucket=bucket,
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/botocore/client.py", line 530, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/botocore/client.py", line 964, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AuthenticationRequired) when calling the ListObjectsV2 operation: Authentication required.

We need to investigate and fix both of these and close all other GT issues that were opened due to 15126 or 15128

P.S. Also, going to open a Jira task to improve PandaTriage logging to make it easier to map test error with GH issue AND related Buildkite Job

Example of how this should be logged:

opening issue for test failure StreamVerifierTest.test_simple_produce_consume_txn_with_add_node created issue: https://github.com/redpanda-data/redpanda/issues/21625 Based on CI job: https://buildkite.com/redpanda/vtools/builds/15926

michael-redpanda commented 1 month ago

Moved Jira issue to https://redpandadata.atlassian.net/browse/PESDLC-1717

vbotbuildovich commented 1 month ago

https://buildkite.com/redpanda/vtools/builds/15980 https://buildkite.com/redpanda/vtools/builds/15980

vbotbuildovich commented 1 month ago

*https://buildkite.com/redpanda/vtools/builds/16015

vbotbuildovich commented 1 month ago

https://buildkite.com/redpanda/vtools/builds/16016 https://buildkite.com/redpanda/vtools/builds/16016

vbotbuildovich commented 1 month ago

https://buildkite.com/redpanda/vtools/builds/16030 https://buildkite.com/redpanda/vtools/builds/16030

vbotbuildovich commented 1 month ago

*https://buildkite.com/redpanda/vtools/builds/16155