CI Failure (TimeoutError: Workload did not complete within 120s: Consumer consumed only 1591 out of 15360 messages) in `CloudStorageTimingStressTest.test_cloud_storage_with_partition_moves`

bharathv commented 10 months ago

https://buildkite.com/redpanda/redpanda/builds/42019 https://buildkite.com/redpanda/redpanda/builds/42187

Module: rptest.tests.cloud_storage_timing_stress_test
Class: CloudStorageTimingStressTest
Method: test_cloud_storage_with_partition_moves
Arguments: {
    "cleanup_policy": "compact,delete"
}

test_id:    CloudStorageTimingStressTest.test_cloud_storage_with_partition_moves
status:     FAIL
run time:   143.147 seconds

TimeoutError('Workload did not complete within 120s: Consumer consumed only 1591 out of 15360 messages')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 269, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 82, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/cloud_storage_timing_stress_test.py", line 539, in test_cloud_storage_with_partition_moves
    while not self.is_complete():
  File "/root/tests/rptest/tests/cloud_storage_timing_stress_test.py", line 370, in is_complete
    raise TimeoutError(
TimeoutError: Workload did not complete within 120s: Consumer consumed only 1591 out of 15360 messages

JIRA Link: CORE-1614

bharathv commented 10 months ago

This was originally triaged as a consumer crash in https://github.com/redpanda-data/redpanda/issues/12232. The crash was then fixed but the issue surfaced again. I don't see a crash in kgo this time.

andijcr commented 10 months ago

on the kgo consumer side i see reads going well up until 07:09:03, then this starts to happen

time="2023-11-30T07:09:03Z" level=debug msg="Read OK (000000.000000000000001184) on p=0 at o=1184"
time="2023-11-30T07:09:03Z" level=debug msg="Calling PollFetches (lwm=[1185] status {\"name\":\"\",\"valid_reads\":1591,\"invalid_reads\":0,\"out_of_scope_invalid_reads\":0,\"max_offsets_consumed\":{\"0\":1302}})"
time="2023-11-30T07:09:04Z" name=KgoVerifierSeqConsumer-0-281472112737776[DEBUG] opening connection to broker; addr: docker-rp-6:9092, broker: 1
time="2023-11-30T07:09:04Z" name=KgoVerifierSeqConsumer-0-281472112737776[DEBUG] connection opened to broker; addr: docker-rp-6:9092, broker: 1
time="2023-11-30T07:09:04Z" name=KgoVerifierSeqConsumer-0-281472112737776[DEBUG] connection initialized successfully; addr: docker-rp-6:9092, broker: 1
time="2023-11-30T07:09:04Z" name=KgoVerifierSeqConsumer-0-281472112737776[DEBUG] wrote Fetch v11; broker: 1, bytes_written: 133, write_wait: 606.416Âµs, time_to_write: 34.132Âµs, err: <nil>
time="2023-11-30T07:09:04Z" name=KgoVerifierSeqConsumer-0-281472112737776[DEBUG] read Fetch v11; broker: 1, bytes_read: 0, read_wait: 73.188Âµs, time_to_read: 3.758924ms, err: EOF
time="2023-11-30T07:09:04Z" name=KgoVerifierSeqConsumer-0-281472112737776[WARN] read from broker errored, killing connection after 0 successful responses (is SASL missing?); addr: docker-rp-6:9092, broker: 1, err: EOF

at the same time, on docker-rp-6 this starts to appear

TRACE 2023-11-30 07:09:03,830 [shard 1:fetc] http - /92f31512/kafka/test-topic/0_27/1181-1314-4211021-1-v1.log.1 - client.cc:296 - chunk received, chunk length 850
DEBUG 2023-11-30 07:09:03,830 [shard 1:fetc] http - iobuf_body.cc:82 - reader - finish called
WARN  2023-11-30 07:09:03,830 [shard 1:fetc] s3 - s3_client.cc:562 - S3 GET request failed: Not Found [Accept-Ranges: bytes];[Content-Length: 445];[Content-Security-Policy: block-all-mixed-content];[Content-Type: application/xml];[Server: MinIO];[Strict-Transport-Security: max-age=31536000; includeSubDomains];[Vary: Origin];[Vary: Accept-Encoding];[X-Amz-Request-Id: 179C54EECB5DDF32];[X-Content-Type-Options: nosniff];[X-Xss-Protection: 1; mode=block];[Date: Thu, 30 Nov 2023 07:09:03 GMT];
DEBUG 2023-11-30 07:09:03,830 [shard 1:fetc] s3 - s3_client.cc:427 - NoSuchKey response received {"92f31512/kafka/test-topic/0_27/1181-1314-4211021-1-v1.log.1"}
WARN  2023-11-30 07:09:03,830 [shard 1:fetc] cloud_storage - [fiber3~8~5~0|1|59998ms] - remote.cc:801 - Downloading segment from {panda-bucket-4e5927ee-8f4f-11ee-a158-0242ac101024}, {key_not_found}, segment at {"92f31512/kafka/test-topic/0_27/1181-1314-4211021-1-v1.log.1"} not available
DEBUG 2023-11-30 07:09:03,830 [shard 1:fetc] client_pool - client_pool.cc:463 - releasing a client, pool size: 19, capacity: 20
WARN  2023-11-30 07:09:03,830 [shard 1:fetc] cloud_storage - [fiber49 92f31512/kafka/test-topic/0_27/1181-1314-4211021-1-v1.log.1] - segment_chunk_data_source.cc:88 - failed to hydrate chunk starting at 1052608, error: cloud_storage::download_exception (NotFound)
WARN  2023-11-30 07:09:03,830 [shard 1:fetc] cloud_storage - [fiber86 kafka/test-topic/0] - remote_partition.cc:464 - exception thrown while reading from remote_partition: NotFound
DEBUG 2023-11-30 07:09:03,830 [shard 1:fetc] cloud_storage - [fiber3~8~0 kafka/test-topic/0] - remote_segment.cc:1412 - [{{172.16.16.32:53242}}] remote_segment_batch_reader::stop
DEBUG 2023-11-30 07:09:03,830 [shard 1:fetc] cloud_storage - [fiber3~8~0 kafka/test-topic/0] - remote_segment.cc:1418 - [{{172.16.16.32:53242}}] remote_segment_batch_reader::stop - parser-close
DEBUG 2023-11-30 07:09:03,830 [shard 1:fetc] cloud_storage - [fiber49 92f31512/kafka/test-topic/0_27/1181-1314-4211021-1-v1.log.1] - segment_chunk_data_source.cc:48 - chunk data source destroyed
TRACE 2023-11-30 07:09:03,830 [shard 1:fetc] cloud_storage - [fiber86 kafka/test-topic/0] - remote_partition.cc:257 - Destructing reader {kafka/test-topic/0}
WARN  2023-11-30 07:09:03,831 [shard 0:main] kafka - connection_context.cc:498 - Error processing request: cloud_storage::download_exception (NotFound)
DEBUG 2023-11-30 07:09:03,831 [shard 0:main] kafka - connection_context.h:148 - Connection input_shutdown; aborting operations for 172.16.16.32:53242
TRACE 2023-11-30 07:09:03,831 [shard 0:main] kafka - connection_context.h:160 - stopping connection context for 172.16.16.32:53242
TRACE 2023-11-30 07:09:03,831 [shard 0:main] kafka - connection_context.h:168 - stopped connection context for 172.16.16.32:53242
TRACE 2023-11-30 07:09:03,831 [shard 0:main] kafka - server.cc:178 - shutting down connection 172.16.16.32:53242

so redpanda can't find this segment 92f31512/kafka/test-topic/0_27/1181-1314-4211021-1-v1.log.1

cloud_storage_diagnostic.zip lists the segment under 70000000_meta_kafka_test-topic_0_27_manifest.bin.0.1476.0.1442.1701328135741.1701328140375

    "1181-1-v1.log": {
      "base_offset": 1181,
      "committed_offset": 1314,
      "is_compacted": true,
      "size_bytes": 329529,
      "archiver_term": 1,
      "delta_offset": 22,
      "base_timestamp": 1701328139521,
      "max_timestamp": 1701328139769,
      "ntp_revision": 27,
      "sname_format": 3,
      "segment_term": 1,
      "delta_offset_end": 28
    },

in fact we see this

49394:INFO  2023-11-30 07:09:01,938 [shard 1:au  ] archival - [fiber4 kafka/test-topic/0] - ntp_archiver_service.cc:2695 - Deleting segment from cloud storage: {"92f31512/kafka/test-topic/0_27/1181-1314-4211021-1-v1.log.1"}

and the manifest being uploaded just after

DEBUG 2023-11-30 07:09:01,941 [shard 1:au  ] cloud_storage - [fiber4~51~0|1|10000ms] - remote.cc:374 - Uploading manifest {"70000000/meta/kafka/test-topic/0_27/manifest.bin"} to the panda-bucket-4e5927ee-8f4f-11ee-a158-0242ac101024

it seems like there is a spillover going out of sync.

andijcr commented 10 months ago

some thoughts:

the consumer tries to read kafka offset 1185 at 07:09:03 and it's contained in the remote segment "92f31512/kafka/test-topic/0_27/1181-1314-4211021-1-v1.log.1" (the kafka offset is between the deltas of this segment)

TRACE 2023-11-30 07:09:03,828 [shard 0:fetc] kafka - handler.h:69 - [client_id: {KgoVerifierSeqConsumer-0-281472112737776}] handling fetch v11 request {replica_id=-1 max_wait_ms=5000 min_bytes=1 max_bytes=838860 isolation_level=read_uncommitted session_id=2 session_epoch=27 topics={{name={test-topic} fetch_partitions={{partition_index=0 current_leader_epoch=1 fetch_offset=1185 log_start_offset=-1 max_bytes=838860}}}} forgotten={} rack_id={}}

but that segment was deleted from cloud at 07:09:01

49394:INFO  2023-11-30 07:09:01,938 [shard 1:au  ] archival - [fiber4 kafka/test-topic/0] - ntp_archiver_service.cc:2695 - Deleting segment from cloud storage: {"92f31512/kafka/test-topic/0_27/1181-1314-4211021-1-v1.log.1"}

at 07:09:00 the cloud manifest start offset is 0 and there are no spillover manifests at 07:09:10 the cloud manifest start offset is 1477 and there is a spillover manifest that covers the offset 1181.

the log line "Deleting segment..." is printed from ntp_archiver_stm::garbage_collect(). that in turn can be called only when there are no spillover manifests (and there is a mutex to protect this). the segments that end up in the to_remove list are

replaced segments (no visibility here)
segments with committed offset < than _start_offset

but

the offset range should be retained in cloud storage by some segment, to be able to delete this
_start_offset was 0 few seconds before and few seconds after, so garbage_collects should not remove the select the segment for removal

the garbage collection code is protected by a mutex

the alternative is that replica movement is racing with this mechanism, but the test is not moving the replica that is serving the request

the linked pr has some precondition checks around garbage_collect, the idea being that is should assert in case a race condition

andijcr commented 10 months ago

issue is https://github.com/redpanda-data/redpanda/issues/15422 with fix in https://github.com/redpanda-data/redpanda/pull/15423. The smoking gun is that the remote segment is trying to consume 1181-1314-4211021-1-v1.log.1 but the spillover manifest has a different object for that range

 "1181-1-v1.log": {
      "base_offset": 1181,
      "committed_offset": 1314,
      "is_compacted": true,
      "size_bytes": 329529,
      "archiver_term": 1,
      "delta_offset": 22,
      "base_timestamp": 1701328139521,
      "max_timestamp": 1701328139769,
      "ntp_revision": 27,
      "sname_format": 3,
      "segment_term": 1,
      "delta_offset_end": 28
    },

, size_bytes is different (4211021 vs 329529) but remote_segment does not account for that and keeps a stale state

vbotbuildovich commented 9 months ago

https://buildkite.com/redpanda/redpanda/builds/43425 https://buildkite.com/redpanda/redpanda/builds/43426 https://buildkite.com/redpanda/redpanda/builds/43534 https://buildkite.com/redpanda/redpanda/builds/43584

vbotbuildovich commented 9 months ago

https://buildkite.com/redpanda/redpanda/builds/43627 https://buildkite.com/redpanda/redpanda/builds/43632 *https://buildkite.com/redpanda/vtools/builds/11400

vbotbuildovich commented 9 months ago

https://buildkite.com/redpanda/redpanda/builds/43613 https://buildkite.com/redpanda/redpanda/builds/43633

vbotbuildovich commented 9 months ago

https://buildkite.com/redpanda/redpanda/builds/43662 https://buildkite.com/redpanda/redpanda/builds/43664 *https://buildkite.com/redpanda/redpanda/builds/43683

vbotbuildovich commented 9 months ago

https://buildkite.com/redpanda/redpanda/builds/43873 https://buildkite.com/redpanda/redpanda/builds/43874 https://buildkite.com/redpanda/redpanda/builds/43932 https://buildkite.com/redpanda/redpanda/builds/43932 *https://buildkite.com/redpanda/redpanda/builds/43973

vbotbuildovich commented 9 months ago

*https://buildkite.com/redpanda/redpanda/builds/44032

vbotbuildovich commented 9 months ago

*https://buildkite.com/redpanda/redpanda/builds/44033

vbotbuildovich commented 9 months ago

*https://buildkite.com/redpanda/redpanda/builds/44052