Closed andijcr closed 1 month ago
not sure if this should be reported as error
This was already fixed with: https://github.com/redpanda-data/redpanda/pull/15192
at least the last failure mode is different:
<BadLogLines nodes=ip-172-31-13-176(2),ip-172-31-5-190(7),ip-172-31-2-19(29) example="ERROR 2023-12-22 10:09:03,417 [shard 1:au ] archival - [fiber11 kafka/fuzzy-operator-2047-ygyvas/0] - ntp_archiver_service.cc:2832 - Failed to make reupload candidate to match the run, candidate: {source segment offsets: {term:3, base_offset:15, committed_offset:483, dirty_offset:483}, exposed_name: {15-3-v1.log}, starting_offset: 15, file_offset: 0, content_length: 5108, final_offset: 483, final_file_offset: 5108, term: 3, source names: {/var/lib/redpanda/data/kafka/fuzzy-operator-2047-ygyvas/0_6696/15-3-v1.log}}, run: {is_compacted: true, size_bytes: 5224, base_offset: 15, committed_offset: 483, base_timestamp: {timestamp: 1703239361564}, max_timestamp: {timestamp: 1703239430338}, delta_offset: 15, ntp_revision: 64, archiver_term: 4, segment_term: 3, delta_offset_end: 25, sname_format: {v3}, metadata_size_hint: 0}">
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 184, in _do_run
data = self.run_test()
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 269, in run_test
return self.test_context.function(self.test)
File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 142, in wrapped
redpanda.raise_on_bad_logs(
File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 1325, in raise_on_bad_logs
raise BadLogLines(bad_lines)
rptest.services.utils.BadLogLines: <BadLogLines nodes=ip-172-31-13-176(2),ip-172-31-5-190(7),ip-172-31-2-19(29) example="ERROR 2023-12-22 10:09:03,417 [shard 1:au ] archival - [fiber11 kafka/fuzzy-operator-2047-ygyvas/0] - ntp_archiver_service.cc:2832 - Failed to make reupload candidate to match the run, candidate: {source segment offsets: {term:3, base_offset:15, committed_offset:483, dirty_offset:483}, exposed_name: {15-3-v1.log}, starting_offset: 15, file_offset: 0, content_length: 5108, final_offset: 483, final_file_offset: 5108, term: 3, source names: {/var/lib/redpanda/data/kafka/fuzzy-operator-2047-ygyvas/0_6696/15-3-v1.log}}, run: {is_compacted: true, size_bytes: 5224, base_offset: 15, committed_offset: 483, base_timestamp: {timestamp: 1703239361564}, max_timestamp: {timestamp: 1703239430338}, delta_offset: 15, ntp_revision: 64, archiver_term: 4, segment_term: 3, delta_offset_end: 25, sname_format: {v3}, metadata_size_hint: 0}">
https://buildkite.com/redpanda/vtools/builds/11651
<BadLogLines nodes=rp-node-9-needlessly-choice-egret(1) example="ERROR 2024-01-30 09:22:56,880 [shard 2:au ] cloud_storage - [fiber76~122|0|29632ms] - remote.cc:1243 - Failed to delete keys: Semaphore broken: self_config_barrier">
this specific instance of failure will be cleaned up here https://github.com/redpanda-data/redpanda/pull/16385
*https://buildkite.com/redpanda/vtools/builds/11668
TimeoutError('Redpanda service rp-node-14-early-subtle-jennet failed to start within 360 sec')
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 184, in _do_run
data = self.run_test()
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 269, in run_test
return self.test_context.function(self.test)
File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 99, in wrapped
r = f(self, *args, **kwargs)
File "/home/ubuntu/redpanda/tests/rptest/tests/random_node_operations_test.py", line 428, in test_node_operations
executor.execute_operation(op)
File "/home/ubuntu/redpanda/tests/rptest/utils/node_operations.py", line 430, in execute_operation
self.add(operation.node)
File "/home/ubuntu/redpanda/tests/rptest/utils/node_operations.py", line 388, in add
self.redpanda.start_node(node,
File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 2487, in start_node
self.start_service(node, start_rp)
File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 2563, in start_service
start()
File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 2478, in start_rp
wait_until(
File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Redpanda service rp-node-14-early-subtle-jennet failed to start within 360 sec
<BadLogLines nodes=rp-node-11-perfectly-handy-oryx(1) example="ERROR 2024-02-03 09:04:19,088 [shard 1:au ] archival - [fiber1~8|0|19915ms] - purger.cc:168 - Permanent failures encountered while purging partition {kafka/fuzzy-operator-6695-lqmlrr/7}. Giving up ...">
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 184, in _do_run
data = self.run_test()
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 269, in run_test
return self.test_context.function(self.test)
File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 168, in wrapped
redpanda.raise_on_bad_logs(
File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 1413, in raise_on_bad_logs
raise BadLogLines(bad_lines)
rptest.services.utils.BadLogLines: <BadLogLines nodes=rp-node-11-perfectly-handy-oryx(1) example="ERROR 2024-02-03 09:04:19,088 [shard 1:au ] archival - [fiber1~8|0|19915ms] - purger.cc:168 - Permanent failures encountered while purging partition {kafka/fuzzy-operator-6695-lqmlrr/7}. Giving up ...">
a look at the log points out to the manifest for the partition not being in cloud storage, so maybe an edge case of a manifest not yet uploaded or a race in deleting it https://github.com/redpanda-data/redpanda/blob/72b7011faf7a1cb29765db2a2b4343667a03cc0a/src/v/archival/purger.cc#L239-L264
https://buildkite.com/redpanda/vtools/builds/16399 https://buildkite.com/redpanda/vtools/builds/16399 https://buildkite.com/redpanda/vtools/builds/16399 https://buildkite.com/redpanda/vtools/builds/16408 https://buildkite.com/redpanda/vtools/builds/16408 https://buildkite.com/redpanda/vtools/builds/16408 https://buildkite.com/redpanda/vtools/builds/16407 https://buildkite.com/redpanda/vtools/builds/16407 https://buildkite.com/redpanda/vtools/builds/16407 https://buildkite.com/redpanda/vtools/builds/16407
https://buildkite.com/redpanda/vtools/builds/10805
This issue is an umbrella for 3 instances of bad log lines:
Can't apply override to kafka start offset
-> Fixed in #15192Failed to make reupload candidate to match the run, candidate: {source segment offsets: {term:3, base_offset:15, committed_offset:483, dirty_offset:483}, exposed_name: {15-3-v1.log}, starting_offset: 15, file_offset: 0, content_length: 5108, final_offset: 483, final_file_offset: 5108, term: 3, source names: {/var/lib/redpanda/data/kafka/fuzzy-operator-2047-ygyvas/0_6696/15-3-v1.log}}, run: {is_compacted: true, size_bytes: 5224, base_offset: 15, committed_offset: 483, base_timestamp: {timestamp: 1703239361564}, max_timestamp: {timestamp: 1703239430338}, delta_offset: 15, ntp_revision: 64, archiver_term: 4, segment_term: 3, delta_offset_end: 25, sname_format: {v3}, metadata_size_hint: 0}
Discussed in https://redpandadata.slack.com/archives/C05S5SBT4CU/p1704303384062129 Builds:ERROR 2023-12-14 11:38:06,999 [shard 0:main] cluster - topic_table.cc:184 - Unexpected record at offset 5399 to drop non-existent lifecycle marker {kafka/fuzzy-operator-5709-agcwue} 3685
Builds:ERROR 2023-12-18 11:13:10,469 [shard 0:au ] archival - [fiber1~3|0|19974ms] - purger.cc:168 - Permanent failures encountered while purging partition {kafka/fuzzy-operator-532-qebtjv/2}. Giving up ...
Builds:JIRA Link: CORE-1601