redpanda-data / redpanda

Redpanda is a streaming data platform for developers. Kafka API compatible. 10x faster. No ZooKeeper. No JVM!
https://redpanda.com
9.57k stars 582 forks source link

CI Failure (TimeoutError: Workload did not complete within 120s: Consumer consumed only 1591 out of 15360 messages) in `CloudStorageTimingStressTest.test_cloud_storage_with_partition_moves` #15312

Closed bharathv closed 3 weeks ago

bharathv commented 10 months ago

https://buildkite.com/redpanda/redpanda/builds/42019 https://buildkite.com/redpanda/redpanda/builds/42187

Module: rptest.tests.cloud_storage_timing_stress_test
Class: CloudStorageTimingStressTest
Method: test_cloud_storage_with_partition_moves
Arguments: {
    "cleanup_policy": "compact,delete"
}
test_id:    CloudStorageTimingStressTest.test_cloud_storage_with_partition_moves
status:     FAIL
run time:   143.147 seconds

TimeoutError('Workload did not complete within 120s: Consumer consumed only 1591 out of 15360 messages')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 269, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 82, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/cloud_storage_timing_stress_test.py", line 539, in test_cloud_storage_with_partition_moves
    while not self.is_complete():
  File "/root/tests/rptest/tests/cloud_storage_timing_stress_test.py", line 370, in is_complete
    raise TimeoutError(
TimeoutError: Workload did not complete within 120s: Consumer consumed only 1591 out of 15360 messages

JIRA Link: CORE-1614

bharathv commented 10 months ago

This was originally triaged as a consumer crash in https://github.com/redpanda-data/redpanda/issues/12232. The crash was then fixed but the issue surfaced again. I don't see a crash in kgo this time.

andijcr commented 10 months ago

on the kgo consumer side i see reads going well up until 07:09:03, then this starts to happen

time="2023-11-30T07:09:03Z" level=debug msg="Read OK (000000.000000000000001184) on p=0 at o=1184"
time="2023-11-30T07:09:03Z" level=debug msg="Calling PollFetches (lwm=[1185] status {\"name\":\"\",\"valid_reads\":1591,\"invalid_reads\":0,\"out_of_scope_invalid_reads\":0,\"max_offsets_consumed\":{\"0\":1302}})"
time="2023-11-30T07:09:04Z" name=KgoVerifierSeqConsumer-0-281472112737776[DEBUG] opening connection to broker; addr: docker-rp-6:9092, broker: 1
time="2023-11-30T07:09:04Z" name=KgoVerifierSeqConsumer-0-281472112737776[DEBUG] connection opened to broker; addr: docker-rp-6:9092, broker: 1
time="2023-11-30T07:09:04Z" name=KgoVerifierSeqConsumer-0-281472112737776[DEBUG] connection initialized successfully; addr: docker-rp-6:9092, broker: 1
time="2023-11-30T07:09:04Z" name=KgoVerifierSeqConsumer-0-281472112737776[DEBUG] wrote Fetch v11; broker: 1, bytes_written: 133, write_wait: 606.416µs, time_to_write: 34.132µs, err: <nil>
time="2023-11-30T07:09:04Z" name=KgoVerifierSeqConsumer-0-281472112737776[DEBUG] read Fetch v11; broker: 1, bytes_read: 0, read_wait: 73.188µs, time_to_read: 3.758924ms, err: EOF
time="2023-11-30T07:09:04Z" name=KgoVerifierSeqConsumer-0-281472112737776[WARN] read from broker errored, killing connection after 0 successful responses (is SASL missing?); addr: docker-rp-6:9092, broker: 1, err: EOF

at the same time, on docker-rp-6 this starts to appear

TRACE 2023-11-30 07:09:03,830 [shard 1:fetc] http - /92f31512/kafka/test-topic/0_27/1181-1314-4211021-1-v1.log.1 - client.cc:296 - chunk received, chunk length 850
DEBUG 2023-11-30 07:09:03,830 [shard 1:fetc] http - iobuf_body.cc:82 - reader - finish called
WARN  2023-11-30 07:09:03,830 [shard 1:fetc] s3 - s3_client.cc:562 - S3 GET request failed: Not Found [Accept-Ranges: bytes];[Content-Length: 445];[Content-Security-Policy: block-all-mixed-content];[Content-Type: application/xml];[Server: MinIO];[Strict-Transport-Security: max-age=31536000; includeSubDomains];[Vary: Origin];[Vary: Accept-Encoding];[X-Amz-Request-Id: 179C54EECB5DDF32];[X-Content-Type-Options: nosniff];[X-Xss-Protection: 1; mode=block];[Date: Thu, 30 Nov 2023 07:09:03 GMT];
DEBUG 2023-11-30 07:09:03,830 [shard 1:fetc] s3 - s3_client.cc:427 - NoSuchKey response received {"92f31512/kafka/test-topic/0_27/1181-1314-4211021-1-v1.log.1"}
WARN  2023-11-30 07:09:03,830 [shard 1:fetc] cloud_storage - [fiber3~8~5~0|1|59998ms] - remote.cc:801 - Downloading segment from {panda-bucket-4e5927ee-8f4f-11ee-a158-0242ac101024}, {key_not_found}, segment at {"92f31512/kafka/test-topic/0_27/1181-1314-4211021-1-v1.log.1"} not available
DEBUG 2023-11-30 07:09:03,830 [shard 1:fetc] client_pool - client_pool.cc:463 - releasing a client, pool size: 19, capacity: 20
WARN  2023-11-30 07:09:03,830 [shard 1:fetc] cloud_storage - [fiber49 92f31512/kafka/test-topic/0_27/1181-1314-4211021-1-v1.log.1] - segment_chunk_data_source.cc:88 - failed to hydrate chunk starting at 1052608, error: cloud_storage::download_exception (NotFound)
WARN  2023-11-30 07:09:03,830 [shard 1:fetc] cloud_storage - [fiber86 kafka/test-topic/0] - remote_partition.cc:464 - exception thrown while reading from remote_partition: NotFound
DEBUG 2023-11-30 07:09:03,830 [shard 1:fetc] cloud_storage - [fiber3~8~0 kafka/test-topic/0] - remote_segment.cc:1412 - [{{172.16.16.32:53242}}] remote_segment_batch_reader::stop
DEBUG 2023-11-30 07:09:03,830 [shard 1:fetc] cloud_storage - [fiber3~8~0 kafka/test-topic/0] - remote_segment.cc:1418 - [{{172.16.16.32:53242}}] remote_segment_batch_reader::stop - parser-close
DEBUG 2023-11-30 07:09:03,830 [shard 1:fetc] cloud_storage - [fiber49 92f31512/kafka/test-topic/0_27/1181-1314-4211021-1-v1.log.1] - segment_chunk_data_source.cc:48 - chunk data source destroyed
TRACE 2023-11-30 07:09:03,830 [shard 1:fetc] cloud_storage - [fiber86 kafka/test-topic/0] - remote_partition.cc:257 - Destructing reader {kafka/test-topic/0}
WARN  2023-11-30 07:09:03,831 [shard 0:main] kafka - connection_context.cc:498 - Error processing request: cloud_storage::download_exception (NotFound)
DEBUG 2023-11-30 07:09:03,831 [shard 0:main] kafka - connection_context.h:148 - Connection input_shutdown; aborting operations for 172.16.16.32:53242
TRACE 2023-11-30 07:09:03,831 [shard 0:main] kafka - connection_context.h:160 - stopping connection context for 172.16.16.32:53242
TRACE 2023-11-30 07:09:03,831 [shard 0:main] kafka - connection_context.h:168 - stopped connection context for 172.16.16.32:53242
TRACE 2023-11-30 07:09:03,831 [shard 0:main] kafka - server.cc:178 - shutting down connection 172.16.16.32:53242

so redpanda can't find this segment 92f31512/kafka/test-topic/0_27/1181-1314-4211021-1-v1.log.1

cloud_storage_diagnostic.zip lists the segment under 70000000_meta_kafka_test-topic_0_27_manifest.bin.0.1476.0.1442.1701328135741.1701328140375

    "1181-1-v1.log": {
      "base_offset": 1181,
      "committed_offset": 1314,
      "is_compacted": true,
      "size_bytes": 329529,
      "archiver_term": 1,
      "delta_offset": 22,
      "base_timestamp": 1701328139521,
      "max_timestamp": 1701328139769,
      "ntp_revision": 27,
      "sname_format": 3,
      "segment_term": 1,
      "delta_offset_end": 28
    },

in fact we see this

49394:INFO  2023-11-30 07:09:01,938 [shard 1:au  ] archival - [fiber4 kafka/test-topic/0] - ntp_archiver_service.cc:2695 - Deleting segment from cloud storage: {"92f31512/kafka/test-topic/0_27/1181-1314-4211021-1-v1.log.1"}

and the manifest being uploaded just after

DEBUG 2023-11-30 07:09:01,941 [shard 1:au  ] cloud_storage - [fiber4~51~0|1|10000ms] - remote.cc:374 - Uploading manifest {"70000000/meta/kafka/test-topic/0_27/manifest.bin"} to the panda-bucket-4e5927ee-8f4f-11ee-a158-0242ac101024

it seems like there is a spillover going out of sync.

andijcr commented 10 months ago

some thoughts:

the consumer tries to read kafka offset 1185 at 07:09:03 and it's contained in the remote segment "92f31512/kafka/test-topic/0_27/1181-1314-4211021-1-v1.log.1" (the kafka offset is between the deltas of this segment)

TRACE 2023-11-30 07:09:03,828 [shard 0:fetc] kafka - handler.h:69 - [client_id: {KgoVerifierSeqConsumer-0-281472112737776}] handling fetch v11 request {replica_id=-1 max_wait_ms=5000 min_bytes=1 max_bytes=838860 isolation_level=read_uncommitted session_id=2 session_epoch=27 topics={{name={test-topic} fetch_partitions={{partition_index=0 current_leader_epoch=1 fetch_offset=1185 log_start_offset=-1 max_bytes=838860}}}} forgotten={} rack_id={}}

but that segment was deleted from cloud at 07:09:01

49394:INFO  2023-11-30 07:09:01,938 [shard 1:au  ] archival - [fiber4 kafka/test-topic/0] - ntp_archiver_service.cc:2695 - Deleting segment from cloud storage: {"92f31512/kafka/test-topic/0_27/1181-1314-4211021-1-v1.log.1"}

at 07:09:00 the cloud manifest start offset is 0 and there are no spillover manifests at 07:09:10 the cloud manifest start offset is 1477 and there is a spillover manifest that covers the offset 1181.

the log line "Deleting segment..." is printed from ntp_archiver_stm::garbage_collect(). that in turn can be called only when there are no spillover manifests (and there is a mutex to protect this). the segments that end up in the to_remove list are

  1. replaced segments (no visibility here)
  2. segments with committed offset < than _start_offset

but

  1. the offset range should be retained in cloud storage by some segment, to be able to delete this
  2. _start_offset was 0 few seconds before and few seconds after, so garbage_collects should not remove the select the segment for removal

the garbage collection code is protected by a mutex

the alternative is that replica movement is racing with this mechanism, but the test is not moving the replica that is serving the request

the linked pr has some precondition checks around garbage_collect, the idea being that is should assert in case a race condition

andijcr commented 10 months ago

issue is https://github.com/redpanda-data/redpanda/issues/15422 with fix in https://github.com/redpanda-data/redpanda/pull/15423. The smoking gun is that the remote segment is trying to consume 1181-1314-4211021-1-v1.log.1 but the spillover manifest has a different object for that range

 "1181-1-v1.log": {
      "base_offset": 1181,
      "committed_offset": 1314,
      "is_compacted": true,
      "size_bytes": 329529,
      "archiver_term": 1,
      "delta_offset": 22,
      "base_timestamp": 1701328139521,
      "max_timestamp": 1701328139769,
      "ntp_revision": 27,
      "sname_format": 3,
      "segment_term": 1,
      "delta_offset_end": 28
    },

, size_bytes is different (4211021 vs 329529) but remote_segment does not account for that and keeps a stale state

vbotbuildovich commented 9 months ago

https://buildkite.com/redpanda/redpanda/builds/43425 https://buildkite.com/redpanda/redpanda/builds/43426 https://buildkite.com/redpanda/redpanda/builds/43534 https://buildkite.com/redpanda/redpanda/builds/43584

vbotbuildovich commented 9 months ago

https://buildkite.com/redpanda/redpanda/builds/43627 https://buildkite.com/redpanda/redpanda/builds/43632 *https://buildkite.com/redpanda/vtools/builds/11400

vbotbuildovich commented 9 months ago

https://buildkite.com/redpanda/redpanda/builds/43613 https://buildkite.com/redpanda/redpanda/builds/43633

vbotbuildovich commented 9 months ago

https://buildkite.com/redpanda/redpanda/builds/43662 https://buildkite.com/redpanda/redpanda/builds/43664 *https://buildkite.com/redpanda/redpanda/builds/43683

vbotbuildovich commented 9 months ago

https://buildkite.com/redpanda/redpanda/builds/43873 https://buildkite.com/redpanda/redpanda/builds/43874 https://buildkite.com/redpanda/redpanda/builds/43932 https://buildkite.com/redpanda/redpanda/builds/43932 *https://buildkite.com/redpanda/redpanda/builds/43973

vbotbuildovich commented 9 months ago

*https://buildkite.com/redpanda/redpanda/builds/44032

vbotbuildovich commented 9 months ago

*https://buildkite.com/redpanda/redpanda/builds/44033

vbotbuildovich commented 9 months ago

*https://buildkite.com/redpanda/redpanda/builds/44052

vbotbuildovich commented 8 months ago

https://buildkite.com/redpanda/redpanda/builds/44066 https://buildkite.com/redpanda/redpanda/builds/44069 https://buildkite.com/redpanda/redpanda/builds/44092 https://buildkite.com/redpanda/redpanda/builds/44148 https://buildkite.com/redpanda/redpanda/builds/44163 https://buildkite.com/redpanda/redpanda/builds/44173 https://buildkite.com/redpanda/redpanda/builds/44182 https://buildkite.com/redpanda/redpanda/builds/44195 https://buildkite.com/redpanda/redpanda/builds/44203 https://buildkite.com/redpanda/redpanda/builds/44224 https://buildkite.com/redpanda/redpanda/builds/44274 https://buildkite.com/redpanda/redpanda/builds/44275 https://buildkite.com/redpanda/redpanda/builds/44276 https://buildkite.com/redpanda/redpanda/builds/44278 https://buildkite.com/redpanda/redpanda/builds/44278 https://buildkite.com/redpanda/redpanda/builds/44276 https://buildkite.com/redpanda/redpanda/builds/44292 https://buildkite.com/redpanda/redpanda/builds/44338

vbotbuildovich commented 8 months ago

*https://buildkite.com/redpanda/vtools/builds/11625

vbotbuildovich commented 8 months ago

https://buildkite.com/redpanda/redpanda/builds/44458 https://buildkite.com/redpanda/redpanda/builds/44465

vbotbuildovich commented 8 months ago

*https://buildkite.com/redpanda/redpanda/builds/44508

vbotbuildovich commented 8 months ago

https://buildkite.com/redpanda/redpanda/builds/44630 https://buildkite.com/redpanda/redpanda/builds/44632

vbotbuildovich commented 8 months ago

*https://buildkite.com/redpanda/redpanda/builds/44666

vbotbuildovich commented 8 months ago

*https://buildkite.com/redpanda/redpanda/builds/44688

vbotbuildovich commented 8 months ago

https://buildkite.com/redpanda/redpanda/builds/44725 https://buildkite.com/redpanda/redpanda/builds/44736 *https://buildkite.com/redpanda/redpanda/builds/44740

vbotbuildovich commented 8 months ago

https://buildkite.com/redpanda/redpanda/builds/44793 https://buildkite.com/redpanda/redpanda/builds/44793 https://buildkite.com/redpanda/redpanda/builds/44797 https://buildkite.com/redpanda/redpanda/builds/44824

vbotbuildovich commented 8 months ago

https://buildkite.com/redpanda/redpanda/builds/44910 https://buildkite.com/redpanda/redpanda/builds/44933 https://buildkite.com/redpanda/vtools/builds/11782 https://buildkite.com/redpanda/vtools/builds/11781 https://buildkite.com/redpanda/vtools/builds/11803 https://buildkite.com/redpanda/vtools/builds/11807 https://buildkite.com/redpanda/redpanda/builds/44993 https://buildkite.com/redpanda/vtools/builds/11824 https://buildkite.com/redpanda/redpanda/builds/45012 https://buildkite.com/redpanda/vtools/builds/11840

vbotbuildovich commented 8 months ago

*https://buildkite.com/redpanda/vtools/builds/11858

vbotbuildovich commented 8 months ago

*https://buildkite.com/redpanda/redpanda/builds/45058

vbotbuildovich commented 8 months ago

*https://buildkite.com/redpanda/redpanda/builds/45069

vbotbuildovich commented 8 months ago

https://buildkite.com/redpanda/redpanda/builds/45097 https://buildkite.com/redpanda/vtools/builds/11901

vbotbuildovich commented 8 months ago

*https://buildkite.com/redpanda/vtools/builds/11911

vbotbuildovich commented 8 months ago

*https://buildkite.com/redpanda/redpanda/builds/45145

vbotbuildovich commented 7 months ago

https://buildkite.com/redpanda/redpanda/builds/45182 https://buildkite.com/redpanda/vtools/builds/11952

vbotbuildovich commented 7 months ago

*https://buildkite.com/redpanda/redpanda/builds/45285

vbotbuildovich commented 7 months ago

https://buildkite.com/redpanda/vtools/builds/11982 https://buildkite.com/redpanda/vtools/builds/11999

vbotbuildovich commented 7 months ago

*https://buildkite.com/redpanda/redpanda/builds/45394

vbotbuildovich commented 7 months ago

https://buildkite.com/redpanda/redpanda/builds/45483 https://buildkite.com/redpanda/vtools/builds/12035

vbotbuildovich commented 7 months ago

https://buildkite.com/redpanda/redpanda/builds/45587 https://buildkite.com/redpanda/vtools/builds/12069

vbotbuildovich commented 7 months ago

*https://buildkite.com/redpanda/vtools/builds/12077

vbotbuildovich commented 7 months ago

*https://buildkite.com/redpanda/redpanda/builds/45776#018e179c-06aa-4e0c-a82d-11df995df780

vbotbuildovich commented 6 months ago

*https://buildkite.com/redpanda/redpanda/builds/47256

vbotbuildovich commented 6 months ago

*https://buildkite.com/redpanda/redpanda/builds/47256

vbotbuildovich commented 6 months ago

https://buildkite.com/redpanda/redpanda/builds/47269 https://buildkite.com/redpanda/redpanda/builds/47268 *https://buildkite.com/redpanda/redpanda/builds/47291

Lazin commented 6 months ago

The failure is different now

test_id:    rptest.tests.cloud_storage_timing_stress_test.CloudStorageTimingStressTest.test_cloud_storage_with_partition_moves.cleanup_policy=compact.delete
status:     FAIL
run time:   1 minute 15.111 seconds
    RuntimeError('Failed checks: 0; Incomplete checks: 2')
Traceback (most recent call last):
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 276, in run_test
    return self.test_context.function(self.test)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/mark/_mark.py", line 535, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 104, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/cloud_storage_timing_stress_test.py", line 550, in test_cloud_storage_with_partition_moves
    self.do_checks()
  File "/root/tests/rptest/tests/cloud_storage_timing_stress_test.py", line 486, in do_checks
    raise RuntimeError(
RuntimeError: Failed checks: 0; Incomplete checks: 2
mmaslankaprv commented 6 months ago

I've looked into it and it seems that there is some kind of manifest inconsistency:

[ERROR - 2024-04-03 14:57:00,074 - rp_storage_tool - _decode - lineno:29]: b'thread \'main\' panicked at \'called `Result::unwrap()` on an `Err` value: SyntaxError("base_timestamp not present in PartitionManifestSegment { base_offset: 8256, committed_offset: 9978, is_compacted: false, size_bytes: 22373918, archiver_term: 1, delta_offset: Some(265), base_timestamp: Some(1712156190221), max_timestamp: None, ntp_revision: Some(28), sname_format: Some(3), segment_term: Some(1), delta_offset_end: Some(324) }")\', src/main.rs:1065:75\nnote: run with `RUST_BACKTRACE=1` environment variable to display a backtrace\n'
Lazin commented 6 months ago

I'm working on it at the moment. Looks like the rp-storage-tool fails in some cases. Redpanda state looks OK though.

vbotbuildovich commented 6 months ago

https://buildkite.com/redpanda/redpanda/builds/47556 https://buildkite.com/redpanda/redpanda/builds/47558 https://buildkite.com/redpanda/redpanda/builds/47569 https://buildkite.com/redpanda/redpanda/builds/47592 *https://buildkite.com/redpanda/redpanda/builds/47599

vbotbuildovich commented 6 months ago

*https://buildkite.com/redpanda/vtools/builds/12843

vbotbuildovich commented 6 months ago

*https://buildkite.com/redpanda/vtools/builds/12873

vbotbuildovich commented 6 months ago

*https://buildkite.com/redpanda/redpanda/builds/47789

vbotbuildovich commented 6 months ago

https://buildkite.com/redpanda/redpanda/builds/47832 https://buildkite.com/redpanda/vtools/builds/12890 https://buildkite.com/redpanda/redpanda/builds/47868 https://buildkite.com/redpanda/vtools/builds/12891

vbotbuildovich commented 6 months ago

https://buildkite.com/redpanda/redpanda/builds/47823 https://buildkite.com/redpanda/vtools/builds/12890

vbotbuildovich commented 5 months ago

https://buildkite.com/redpanda/redpanda/builds/47916 https://buildkite.com/redpanda/redpanda/builds/47919 https://buildkite.com/redpanda/vtools/builds/12943 https://buildkite.com/redpanda/vtools/builds/12994 https://buildkite.com/redpanda/vtools/builds/12994 https://buildkite.com/redpanda/redpanda/builds/48032 https://buildkite.com/redpanda/redpanda/builds/48033 https://buildkite.com/redpanda/vtools/builds/13054 https://buildkite.com/redpanda/redpanda/builds/48074 https://buildkite.com/redpanda/redpanda/builds/48087 https://buildkite.com/redpanda/vtools/builds/13072 https://buildkite.com/redpanda/vtools/builds/13079 *https://buildkite.com/redpanda/redpanda/builds/48109

vbotbuildovich commented 5 months ago

*https://buildkite.com/redpanda/vtools/builds/13100