Open vbotbuildovich opened 7 months ago
I suspect this is caused due to an unexpected interaction between the reset_scrubbing_metadata
we do at the end of each test in redpanda.py
, and the test_scrubber
specific expectation that there should be some scrubbed anomalies after the test run. Perhaps the reset wipes the expected anomalies.
To complete the scrub condition, the admin API response should have the last_completed_at...
field.
This does not appear for partition 0
For this partition, last anomalies processed (IE added to the manifest and will appear in admin API response):
DEBUG 2024-05-22 07:26:44,618 [shard 0:main] cloud_storage - partition_manifest.cc:2825 - [{kafka/topic-nvfrgyvzdk/0}] Anomalies processed: { detected: {missing_partition_manifest: false, missing_spillover_manifests: 0, missing_segments: 0, segment_metadata_anomalies: 1}, last_partition_scrub: {timestamp: 1716362804425}, last_scrubbed_offset: {nullopt} }
reset is done (the admin API response will now be reset, so the test condition will not succeed):
DEBUG 2024-05-22 07:27:13,068 [shard 0:admi] admin_api_server - server.cc:620 - [admin] POST http://docker-rp-7:9644/v1/cloud_storage/reset_scrubbing_metadata/kafka/topic-nvfrgyvzdk/0
last scrub ends. the stm command is not yet applied, IE the admin API command will still not contain the required field:
INFO 2024-05-22 07:27:13,273 [shard 1:au ] archival - [fiber11 kafka/topic-nvfrgyvzdk/0] - scrubber.cc:140 - Scrub which started at {nullopt} finished at {nullopt} with status {full} and detected {missing_partition_manifest: false, missing_spillover_manifests: 0, missing_segments: 0, segment_metadata_anomalies: 1} and used 57 quota
Shortly afterwards we start failing to apply spillover command. This is expected because the anomaly introduced by the test fails a validation:
TRACE 2024-05-22 07:27:13,279 [shard 1:main] raft - [group_id:1, {kafka/topic-nvfrgyvzdk/0}] state_machine_manager.cc:129 - [default][archival_metadata_stm] applying batch with base 3689 and last 3689 offsets
WARN 2024-05-22 07:27:13,279 [shard 1:main] cloud_storage - partition_manifest.cc:1287 - {kafka/topic-nvfrgyvzdk/0} Can't apply spillover manifest because the start offsets are not aligned: 2092 vs 2165, {o=2165-2846 t={timestamp: 1716362751724}-{timestamp: 1716362753073}}
ERROR 2024-05-22 07:27:13,279 [shard 1:main] cluster - ntp: {kafka/topic-nvfrgyvzdk/0} - archival_metadata_stm.cc:1536 - Can't apply spillover_cmd: {is_compacted: false, size_bytes: 10931066, base_offset: 2165, committed_offset: 2846, base_timestamp: {timestamp: 1716362751724}, max_timestamp: {timestamp: 1716362753073}, delta_offset: 34, ntp_revision: 24, archiver_term: 1, segment_term: 1, delta_offset_end: 53, sname_format: {v3}, metadata_size_hint: 3232}
It may be that because applying the STM command at this offset is failing, we cannot proceed up ahead to the command which will apply last scrubbed at field into the partition manifest and the admin API response never gets set up correctly?
Reset to sev/low as this seems highly likely to be a test issue. We might want to not do the scrubber related post processing step for test_scrubber
.
However after checking the command order (still to be done) if the above hypothesis turns out not to be true, then this can be reclassified.
https://buildkite.com/redpanda/redpanda/builds/48135 https://buildkite.com/redpanda/redpanda/builds/48202 https://buildkite.com/redpanda/redpanda/builds/48668 https://buildkite.com/redpanda/redpanda/builds/48736 https://buildkite.com/redpanda/redpanda/builds/48738 https://buildkite.com/redpanda/redpanda/builds/48958 https://buildkite.com/redpanda/redpanda/builds/49403 https://buildkite.com/redpanda/redpanda/builds/49716 https://buildkite.com/redpanda/redpanda/builds/49820 https://buildkite.com/redpanda/redpanda/builds/49832 *https://buildkite.com/redpanda/redpanda/builds/50048
https://buildkite.com/redpanda/redpanda/builds/48135 https://buildkite.com/redpanda/redpanda/builds/48202 https://buildkite.com/redpanda/redpanda/builds/48668 https://buildkite.com/redpanda/redpanda/builds/48736 https://buildkite.com/redpanda/redpanda/builds/48738 https://buildkite.com/redpanda/redpanda/builds/48958 https://buildkite.com/redpanda/redpanda/builds/49403 https://buildkite.com/redpanda/redpanda/builds/49716 https://buildkite.com/redpanda/redpanda/builds/49820 https://buildkite.com/redpanda/redpanda/builds/49832 *https://buildkite.com/redpanda/redpanda/builds/50048
https://buildkite.com/redpanda/redpanda/builds/48135 https://buildkite.com/redpanda/redpanda/builds/48202 https://buildkite.com/redpanda/redpanda/builds/48668 https://buildkite.com/redpanda/redpanda/builds/48736 https://buildkite.com/redpanda/redpanda/builds/48738 https://buildkite.com/redpanda/redpanda/builds/48958 https://buildkite.com/redpanda/redpanda/builds/49403 https://buildkite.com/redpanda/redpanda/builds/49716 https://buildkite.com/redpanda/redpanda/builds/49820 https://buildkite.com/redpanda/redpanda/builds/49832 *https://buildkite.com/redpanda/redpanda/builds/50048
https://buildkite.com/redpanda/redpanda/builds/48135 https://buildkite.com/redpanda/redpanda/builds/48202 https://buildkite.com/redpanda/redpanda/builds/48668 https://buildkite.com/redpanda/redpanda/builds/48736 https://buildkite.com/redpanda/redpanda/builds/48738 https://buildkite.com/redpanda/redpanda/builds/48958 https://buildkite.com/redpanda/redpanda/builds/49403 https://buildkite.com/redpanda/redpanda/builds/49716 https://buildkite.com/redpanda/redpanda/builds/49820 https://buildkite.com/redpanda/redpanda/builds/49832 *https://buildkite.com/redpanda/redpanda/builds/50048
Did a check of the broker logs and the order does seem to match up:
We should skip the post-test scrub for this test. It already exercises the scrubber code base
https://buildkite.com/redpanda/redpanda/builds/44182
JIRA Link: CORE-1739