redpanda-data / redpanda

Redpanda is a streaming data platform for developers. Kafka API compatible. 10x faster. No ZooKeeper. No JVM!
https://redpanda.com
9.38k stars 577 forks source link

CI Failure (TimeoutError) in `CloudStorageScrubberTest.test_scrubber` #16332

Open vbotbuildovich opened 7 months ago

vbotbuildovich commented 7 months ago

https://buildkite.com/redpanda/redpanda/builds/44182

Module: rptest.tests.cloud_storage_scrubber_test
Class: CloudStorageScrubberTest
Method: test_scrubber
Arguments: {
    "cloud_storage_type": 2
}
test_id:    CloudStorageScrubberTest.test_scrubber
status:     FAIL
run time:   266.330 seconds

TimeoutError('')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 269, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 159, in wrapped
    self.redpanda.maybe_do_internal_scrub()
  File "/root/tests/rptest/services/redpanda.py", line 4016, in maybe_do_internal_scrub
    results = self.wait_for_internal_scrub(cloud_partitions)
  File "/root/tests/rptest/services/redpanda.py", line 4203, in wait_for_internal_scrub
    wait_until(all_partitions_scrubbed, timeout_sec=timeout, backoff_sec=5)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError

JIRA Link: CORE-1739

vbotbuildovich commented 7 months ago

*https://buildkite.com/redpanda/redpanda/builds/44415

vbotbuildovich commented 5 months ago

*https://buildkite.com/redpanda/redpanda/builds/46221

vbotbuildovich commented 5 months ago

*https://buildkite.com/redpanda/redpanda/builds/47394

vbotbuildovich commented 4 months ago

*https://buildkite.com/redpanda/redpanda/builds/48135

vbotbuildovich commented 4 months ago

*https://buildkite.com/redpanda/redpanda/builds/48202

vbotbuildovich commented 4 months ago

*https://buildkite.com/redpanda/redpanda/builds/48668

vbotbuildovich commented 4 months ago

https://buildkite.com/redpanda/redpanda/builds/48736 https://buildkite.com/redpanda/redpanda/builds/48738

vbotbuildovich commented 3 months ago

*https://buildkite.com/redpanda/redpanda/builds/48958

vbotbuildovich commented 3 months ago

*https://buildkite.com/redpanda/redpanda/builds/49403

abhijat commented 3 months ago

I suspect this is caused due to an unexpected interaction between the reset_scrubbing_metadata we do at the end of each test in redpanda.py, and the test_scrubber specific expectation that there should be some scrubbed anomalies after the test run. Perhaps the reset wipes the expected anomalies.

To complete the scrub condition, the admin API response should have the last_completed_at... field.

This does not appear for partition 0

For this partition, last anomalies processed (IE added to the manifest and will appear in admin API response):

DEBUG 2024-05-22 07:26:44,618 [shard 0:main] cloud_storage - partition_manifest.cc:2825 - [{kafka/topic-nvfrgyvzdk/0}] Anomalies processed: { detected: {missing_partition_manifest: false, missing_spillover_manifests: 0, missing_segments: 0, segment_metadata_anomalies: 1}, last_partition_scrub: {timestamp: 1716362804425}, last_scrubbed_offset: {nullopt} }

reset is done (the admin API response will now be reset, so the test condition will not succeed):

DEBUG 2024-05-22 07:27:13,068 [shard 0:admi] admin_api_server - server.cc:620 - [admin] POST http://docker-rp-7:9644/v1/cloud_storage/reset_scrubbing_metadata/kafka/topic-nvfrgyvzdk/0

last scrub ends. the stm command is not yet applied, IE the admin API command will still not contain the required field:

INFO  2024-05-22 07:27:13,273 [shard 1:au  ] archival - [fiber11 kafka/topic-nvfrgyvzdk/0] - scrubber.cc:140 - Scrub which started at {nullopt} finished at {nullopt} with status {full} and detected {missing_partition_manifest: false, missing_spillover_manifests: 0, missing_segments: 0, segment_metadata_anomalies: 1} and used 57 quota

Shortly afterwards we start failing to apply spillover command. This is expected because the anomaly introduced by the test fails a validation:

TRACE 2024-05-22 07:27:13,279 [shard 1:main] raft - [group_id:1, {kafka/topic-nvfrgyvzdk/0}] state_machine_manager.cc:129 - [default][archival_metadata_stm] applying batch with base 3689 and last 3689 offsets
WARN  2024-05-22 07:27:13,279 [shard 1:main] cloud_storage - partition_manifest.cc:1287 - {kafka/topic-nvfrgyvzdk/0} Can't apply spillover manifest because the start offsets are not aligned: 2092 vs 2165, {o=2165-2846 t={timestamp: 1716362751724}-{timestamp: 1716362753073}}
ERROR 2024-05-22 07:27:13,279 [shard 1:main] cluster - ntp: {kafka/topic-nvfrgyvzdk/0} - archival_metadata_stm.cc:1536 - Can't apply spillover_cmd: {is_compacted: false, size_bytes: 10931066, base_offset: 2165, committed_offset: 2846, base_timestamp: {timestamp: 1716362751724}, max_timestamp: {timestamp: 1716362753073}, delta_offset: 34, ntp_revision: 24, archiver_term: 1, segment_term: 1, delta_offset_end: 53, sname_format: {v3}, metadata_size_hint: 3232}

It may be that because applying the STM command at this offset is failing, we cannot proceed up ahead to the command which will apply last scrubbed at field into the partition manifest and the admin API response never gets set up correctly?

abhijat commented 3 months ago

Reset to sev/low as this seems highly likely to be a test issue. We might want to not do the scrubber related post processing step for test_scrubber.

However after checking the command order (still to be done) if the above hypothesis turns out not to be true, then this can be reclassified.

vbotbuildovich commented 3 months ago

*https://buildkite.com/redpanda/redpanda/builds/49716

vbotbuildovich commented 3 months ago

https://buildkite.com/redpanda/redpanda/builds/49820 https://buildkite.com/redpanda/redpanda/builds/49832

vbotbuildovich commented 2 months ago

*https://buildkite.com/redpanda/redpanda/builds/50048

vbotbuildovich commented 2 months ago

https://buildkite.com/redpanda/redpanda/builds/48135 https://buildkite.com/redpanda/redpanda/builds/48202 https://buildkite.com/redpanda/redpanda/builds/48668 https://buildkite.com/redpanda/redpanda/builds/48736 https://buildkite.com/redpanda/redpanda/builds/48738 https://buildkite.com/redpanda/redpanda/builds/48958 https://buildkite.com/redpanda/redpanda/builds/49403 https://buildkite.com/redpanda/redpanda/builds/49716 https://buildkite.com/redpanda/redpanda/builds/49820 https://buildkite.com/redpanda/redpanda/builds/49832 *https://buildkite.com/redpanda/redpanda/builds/50048

vbotbuildovich commented 2 months ago

https://buildkite.com/redpanda/redpanda/builds/48135 https://buildkite.com/redpanda/redpanda/builds/48202 https://buildkite.com/redpanda/redpanda/builds/48668 https://buildkite.com/redpanda/redpanda/builds/48736 https://buildkite.com/redpanda/redpanda/builds/48738 https://buildkite.com/redpanda/redpanda/builds/48958 https://buildkite.com/redpanda/redpanda/builds/49403 https://buildkite.com/redpanda/redpanda/builds/49716 https://buildkite.com/redpanda/redpanda/builds/49820 https://buildkite.com/redpanda/redpanda/builds/49832 *https://buildkite.com/redpanda/redpanda/builds/50048

vbotbuildovich commented 2 months ago

https://buildkite.com/redpanda/redpanda/builds/48135 https://buildkite.com/redpanda/redpanda/builds/48202 https://buildkite.com/redpanda/redpanda/builds/48668 https://buildkite.com/redpanda/redpanda/builds/48736 https://buildkite.com/redpanda/redpanda/builds/48738 https://buildkite.com/redpanda/redpanda/builds/48958 https://buildkite.com/redpanda/redpanda/builds/49403 https://buildkite.com/redpanda/redpanda/builds/49716 https://buildkite.com/redpanda/redpanda/builds/49820 https://buildkite.com/redpanda/redpanda/builds/49832 *https://buildkite.com/redpanda/redpanda/builds/50048

vbotbuildovich commented 2 months ago

https://buildkite.com/redpanda/redpanda/builds/48135 https://buildkite.com/redpanda/redpanda/builds/48202 https://buildkite.com/redpanda/redpanda/builds/48668 https://buildkite.com/redpanda/redpanda/builds/48736 https://buildkite.com/redpanda/redpanda/builds/48738 https://buildkite.com/redpanda/redpanda/builds/48958 https://buildkite.com/redpanda/redpanda/builds/49403 https://buildkite.com/redpanda/redpanda/builds/49716 https://buildkite.com/redpanda/redpanda/builds/49820 https://buildkite.com/redpanda/redpanda/builds/49832 *https://buildkite.com/redpanda/redpanda/builds/50048

vbotbuildovich commented 2 months ago

*https://buildkite.com/redpanda/redpanda/builds/50417

vbotbuildovich commented 2 months ago

https://buildkite.com/redpanda/redpanda/builds/50934 https://buildkite.com/redpanda/redpanda/builds/51036

abhijat commented 1 month ago

Did a check of the broker logs and the order does seem to match up:

We should skip the post-test scrub for this test. It already exercises the scrubber code base

vbotbuildovich commented 1 month ago

*https://buildkite.com/redpanda/redpanda/builds/52013