Closed bharathv closed 3 weeks ago
This was originally triaged as a consumer crash in https://github.com/redpanda-data/redpanda/issues/12232. The crash was then fixed but the issue surfaced again. I don't see a crash in kgo this time.
on the kgo consumer side i see reads going well up until 07:09:03, then this starts to happen
time="2023-11-30T07:09:03Z" level=debug msg="Read OK (000000.000000000000001184) on p=0 at o=1184"
time="2023-11-30T07:09:03Z" level=debug msg="Calling PollFetches (lwm=[1185] status {\"name\":\"\",\"valid_reads\":1591,\"invalid_reads\":0,\"out_of_scope_invalid_reads\":0,\"max_offsets_consumed\":{\"0\":1302}})"
time="2023-11-30T07:09:04Z" name=KgoVerifierSeqConsumer-0-281472112737776[DEBUG] opening connection to broker; addr: docker-rp-6:9092, broker: 1
time="2023-11-30T07:09:04Z" name=KgoVerifierSeqConsumer-0-281472112737776[DEBUG] connection opened to broker; addr: docker-rp-6:9092, broker: 1
time="2023-11-30T07:09:04Z" name=KgoVerifierSeqConsumer-0-281472112737776[DEBUG] connection initialized successfully; addr: docker-rp-6:9092, broker: 1
time="2023-11-30T07:09:04Z" name=KgoVerifierSeqConsumer-0-281472112737776[DEBUG] wrote Fetch v11; broker: 1, bytes_written: 133, write_wait: 606.416µs, time_to_write: 34.132µs, err: <nil>
time="2023-11-30T07:09:04Z" name=KgoVerifierSeqConsumer-0-281472112737776[DEBUG] read Fetch v11; broker: 1, bytes_read: 0, read_wait: 73.188µs, time_to_read: 3.758924ms, err: EOF
time="2023-11-30T07:09:04Z" name=KgoVerifierSeqConsumer-0-281472112737776[WARN] read from broker errored, killing connection after 0 successful responses (is SASL missing?); addr: docker-rp-6:9092, broker: 1, err: EOF
at the same time, on docker-rp-6 this starts to appear
TRACE 2023-11-30 07:09:03,830 [shard 1:fetc] http - /92f31512/kafka/test-topic/0_27/1181-1314-4211021-1-v1.log.1 - client.cc:296 - chunk received, chunk length 850
DEBUG 2023-11-30 07:09:03,830 [shard 1:fetc] http - iobuf_body.cc:82 - reader - finish called
WARN 2023-11-30 07:09:03,830 [shard 1:fetc] s3 - s3_client.cc:562 - S3 GET request failed: Not Found [Accept-Ranges: bytes];[Content-Length: 445];[Content-Security-Policy: block-all-mixed-content];[Content-Type: application/xml];[Server: MinIO];[Strict-Transport-Security: max-age=31536000; includeSubDomains];[Vary: Origin];[Vary: Accept-Encoding];[X-Amz-Request-Id: 179C54EECB5DDF32];[X-Content-Type-Options: nosniff];[X-Xss-Protection: 1; mode=block];[Date: Thu, 30 Nov 2023 07:09:03 GMT];
DEBUG 2023-11-30 07:09:03,830 [shard 1:fetc] s3 - s3_client.cc:427 - NoSuchKey response received {"92f31512/kafka/test-topic/0_27/1181-1314-4211021-1-v1.log.1"}
WARN 2023-11-30 07:09:03,830 [shard 1:fetc] cloud_storage - [fiber3~8~5~0|1|59998ms] - remote.cc:801 - Downloading segment from {panda-bucket-4e5927ee-8f4f-11ee-a158-0242ac101024}, {key_not_found}, segment at {"92f31512/kafka/test-topic/0_27/1181-1314-4211021-1-v1.log.1"} not available
DEBUG 2023-11-30 07:09:03,830 [shard 1:fetc] client_pool - client_pool.cc:463 - releasing a client, pool size: 19, capacity: 20
WARN 2023-11-30 07:09:03,830 [shard 1:fetc] cloud_storage - [fiber49 92f31512/kafka/test-topic/0_27/1181-1314-4211021-1-v1.log.1] - segment_chunk_data_source.cc:88 - failed to hydrate chunk starting at 1052608, error: cloud_storage::download_exception (NotFound)
WARN 2023-11-30 07:09:03,830 [shard 1:fetc] cloud_storage - [fiber86 kafka/test-topic/0] - remote_partition.cc:464 - exception thrown while reading from remote_partition: NotFound
DEBUG 2023-11-30 07:09:03,830 [shard 1:fetc] cloud_storage - [fiber3~8~0 kafka/test-topic/0] - remote_segment.cc:1412 - [{{172.16.16.32:53242}}] remote_segment_batch_reader::stop
DEBUG 2023-11-30 07:09:03,830 [shard 1:fetc] cloud_storage - [fiber3~8~0 kafka/test-topic/0] - remote_segment.cc:1418 - [{{172.16.16.32:53242}}] remote_segment_batch_reader::stop - parser-close
DEBUG 2023-11-30 07:09:03,830 [shard 1:fetc] cloud_storage - [fiber49 92f31512/kafka/test-topic/0_27/1181-1314-4211021-1-v1.log.1] - segment_chunk_data_source.cc:48 - chunk data source destroyed
TRACE 2023-11-30 07:09:03,830 [shard 1:fetc] cloud_storage - [fiber86 kafka/test-topic/0] - remote_partition.cc:257 - Destructing reader {kafka/test-topic/0}
WARN 2023-11-30 07:09:03,831 [shard 0:main] kafka - connection_context.cc:498 - Error processing request: cloud_storage::download_exception (NotFound)
DEBUG 2023-11-30 07:09:03,831 [shard 0:main] kafka - connection_context.h:148 - Connection input_shutdown; aborting operations for 172.16.16.32:53242
TRACE 2023-11-30 07:09:03,831 [shard 0:main] kafka - connection_context.h:160 - stopping connection context for 172.16.16.32:53242
TRACE 2023-11-30 07:09:03,831 [shard 0:main] kafka - connection_context.h:168 - stopped connection context for 172.16.16.32:53242
TRACE 2023-11-30 07:09:03,831 [shard 0:main] kafka - server.cc:178 - shutting down connection 172.16.16.32:53242
so redpanda can't find this segment 92f31512/kafka/test-topic/0_27/1181-1314-4211021-1-v1.log.1
cloud_storage_diagnostic.zip lists the segment under 70000000_meta_kafka_test-topic_0_27_manifest.bin.0.1476.0.1442.1701328135741.1701328140375
"1181-1-v1.log": {
"base_offset": 1181,
"committed_offset": 1314,
"is_compacted": true,
"size_bytes": 329529,
"archiver_term": 1,
"delta_offset": 22,
"base_timestamp": 1701328139521,
"max_timestamp": 1701328139769,
"ntp_revision": 27,
"sname_format": 3,
"segment_term": 1,
"delta_offset_end": 28
},
in fact we see this
49394:INFO 2023-11-30 07:09:01,938 [shard 1:au ] archival - [fiber4 kafka/test-topic/0] - ntp_archiver_service.cc:2695 - Deleting segment from cloud storage: {"92f31512/kafka/test-topic/0_27/1181-1314-4211021-1-v1.log.1"}
and the manifest being uploaded just after
DEBUG 2023-11-30 07:09:01,941 [shard 1:au ] cloud_storage - [fiber4~51~0|1|10000ms] - remote.cc:374 - Uploading manifest {"70000000/meta/kafka/test-topic/0_27/manifest.bin"} to the panda-bucket-4e5927ee-8f4f-11ee-a158-0242ac101024
it seems like there is a spillover going out of sync.
some thoughts:
the consumer tries to read kafka offset 1185 at 07:09:03 and it's contained in the remote segment "92f31512/kafka/test-topic/0_27/1181-1314-4211021-1-v1.log.1" (the kafka offset is between the deltas of this segment)
TRACE 2023-11-30 07:09:03,828 [shard 0:fetc] kafka - handler.h:69 - [client_id: {KgoVerifierSeqConsumer-0-281472112737776}] handling fetch v11 request {replica_id=-1 max_wait_ms=5000 min_bytes=1 max_bytes=838860 isolation_level=read_uncommitted session_id=2 session_epoch=27 topics={{name={test-topic} fetch_partitions={{partition_index=0 current_leader_epoch=1 fetch_offset=1185 log_start_offset=-1 max_bytes=838860}}}} forgotten={} rack_id={}}
but that segment was deleted from cloud at 07:09:01
49394:INFO 2023-11-30 07:09:01,938 [shard 1:au ] archival - [fiber4 kafka/test-topic/0] - ntp_archiver_service.cc:2695 - Deleting segment from cloud storage: {"92f31512/kafka/test-topic/0_27/1181-1314-4211021-1-v1.log.1"}
at 07:09:00 the cloud manifest start offset is 0 and there are no spillover manifests at 07:09:10 the cloud manifest start offset is 1477 and there is a spillover manifest that covers the offset 1181.
the log line "Deleting segment..." is printed from ntp_archiver_stm::garbage_collect(). that in turn can be called only when there are no spillover manifests (and there is a mutex to protect this). the segments that end up in the to_remove list are
but
the garbage collection code is protected by a mutex
the alternative is that replica movement is racing with this mechanism, but the test is not moving the replica that is serving the request
the linked pr has some precondition checks around garbage_collect, the idea being that is should assert in case a race condition
issue is https://github.com/redpanda-data/redpanda/issues/15422 with fix in https://github.com/redpanda-data/redpanda/pull/15423.
The smoking gun is that the remote segment is trying to consume 1181-1314-4211021-1-v1.log.1
but the spillover manifest has a different object for that range
"1181-1-v1.log": {
"base_offset": 1181,
"committed_offset": 1314,
"is_compacted": true,
"size_bytes": 329529,
"archiver_term": 1,
"delta_offset": 22,
"base_timestamp": 1701328139521,
"max_timestamp": 1701328139769,
"ntp_revision": 27,
"sname_format": 3,
"segment_term": 1,
"delta_offset_end": 28
},
, size_bytes is different (4211021 vs 329529) but remote_segment does not account for that and keeps a stale state
https://buildkite.com/redpanda/redpanda/builds/44066 https://buildkite.com/redpanda/redpanda/builds/44069 https://buildkite.com/redpanda/redpanda/builds/44092 https://buildkite.com/redpanda/redpanda/builds/44148 https://buildkite.com/redpanda/redpanda/builds/44163 https://buildkite.com/redpanda/redpanda/builds/44173 https://buildkite.com/redpanda/redpanda/builds/44182 https://buildkite.com/redpanda/redpanda/builds/44195 https://buildkite.com/redpanda/redpanda/builds/44203 https://buildkite.com/redpanda/redpanda/builds/44224 https://buildkite.com/redpanda/redpanda/builds/44274 https://buildkite.com/redpanda/redpanda/builds/44275 https://buildkite.com/redpanda/redpanda/builds/44276 https://buildkite.com/redpanda/redpanda/builds/44278 https://buildkite.com/redpanda/redpanda/builds/44278 https://buildkite.com/redpanda/redpanda/builds/44276 https://buildkite.com/redpanda/redpanda/builds/44292 https://buildkite.com/redpanda/redpanda/builds/44338
https://buildkite.com/redpanda/redpanda/builds/44910 https://buildkite.com/redpanda/redpanda/builds/44933 https://buildkite.com/redpanda/vtools/builds/11782 https://buildkite.com/redpanda/vtools/builds/11781 https://buildkite.com/redpanda/vtools/builds/11803 https://buildkite.com/redpanda/vtools/builds/11807 https://buildkite.com/redpanda/redpanda/builds/44993 https://buildkite.com/redpanda/vtools/builds/11824 https://buildkite.com/redpanda/redpanda/builds/45012 https://buildkite.com/redpanda/vtools/builds/11840
The failure is different now
test_id: rptest.tests.cloud_storage_timing_stress_test.CloudStorageTimingStressTest.test_cloud_storage_with_partition_moves.cleanup_policy=compact.delete
status: FAIL
run time: 1 minute 15.111 seconds
RuntimeError('Failed checks: 0; Incomplete checks: 2')
Traceback (most recent call last):
File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 184, in _do_run
data = self.run_test()
File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 276, in run_test
return self.test_context.function(self.test)
File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/mark/_mark.py", line 535, in wrapper
return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
File "/root/tests/rptest/services/cluster.py", line 104, in wrapped
r = f(self, *args, **kwargs)
File "/root/tests/rptest/tests/cloud_storage_timing_stress_test.py", line 550, in test_cloud_storage_with_partition_moves
self.do_checks()
File "/root/tests/rptest/tests/cloud_storage_timing_stress_test.py", line 486, in do_checks
raise RuntimeError(
RuntimeError: Failed checks: 0; Incomplete checks: 2
I've looked into it and it seems that there is some kind of manifest inconsistency:
[ERROR - 2024-04-03 14:57:00,074 - rp_storage_tool - _decode - lineno:29]: b'thread \'main\' panicked at \'called `Result::unwrap()` on an `Err` value: SyntaxError("base_timestamp not present in PartitionManifestSegment { base_offset: 8256, committed_offset: 9978, is_compacted: false, size_bytes: 22373918, archiver_term: 1, delta_offset: Some(265), base_timestamp: Some(1712156190221), max_timestamp: None, ntp_revision: Some(28), sname_format: Some(3), segment_term: Some(1), delta_offset_end: Some(324) }")\', src/main.rs:1065:75\nnote: run with `RUST_BACKTRACE=1` environment variable to display a backtrace\n'
I'm working on it at the moment. Looks like the rp-storage-tool fails in some cases. Redpanda state looks OK though.
https://buildkite.com/redpanda/redpanda/builds/47916 https://buildkite.com/redpanda/redpanda/builds/47919 https://buildkite.com/redpanda/vtools/builds/12943 https://buildkite.com/redpanda/vtools/builds/12994 https://buildkite.com/redpanda/vtools/builds/12994 https://buildkite.com/redpanda/redpanda/builds/48032 https://buildkite.com/redpanda/redpanda/builds/48033 https://buildkite.com/redpanda/vtools/builds/13054 https://buildkite.com/redpanda/redpanda/builds/48074 https://buildkite.com/redpanda/redpanda/builds/48087 https://buildkite.com/redpanda/vtools/builds/13072 https://buildkite.com/redpanda/vtools/builds/13079 *https://buildkite.com/redpanda/redpanda/builds/48109
https://buildkite.com/redpanda/redpanda/builds/42019 https://buildkite.com/redpanda/redpanda/builds/42187
JIRA Link: CORE-1614