Closed ajfabbri closed 2 years ago
Missing manifest here was 40000000/meta/kafka/topic-unubmkojgx/57_37/manifest.json
RedpandaService-0-140401461113088/ip-172-31-47-80/redpanda.log
8621:INFO 2022-08-09 20:35:07,197 [shard 2] storage - segment.cc:614 - Creating new segment /var/lib/redpanda/data/kafka/topic-unubmkojgx/57_37/0-1-v1.log
9023:DEBUG 2022-08-09 20:35:07,244 [shard 2] cloud_storage - [fiber175~0~0|1|10000ms] - remote.cc:203 - Download manifest "e0000000/meta/kafka/topic-unubmkojgx/57_37/manifest.json"
9376:INFO 2022-08-09 20:35:07,280 [shard 2] cloud_storage - [fiber175~0~0|1|9964ms] - remote.cc:74 - NoSuchKey response received {"e0000000/meta/kafka/topic-unubmkojgx/57_37/manifest.json"}
9377:DEBUG 2022-08-09 20:35:07,280 [shard 2] cloud_storage - [fiber175~0~0|1|9964ms] - remote.cc:259 - Manifest from {panda-bucket-acc26322-1822-11ed-acdc-5f5f720889de} {key_not_found}, manifest at {"e0000000/meta/kafka/topic-unubmkojgx/57_37/manifest.json"} not found
...
INFO 2022-08-09 20:35:07,280 [shard 2] cloud_storage - [fiber175~0~0|1|9964ms] - remote.cc:74 - NoSuchKey response received {"e0000000/meta/kafka/topic-unubmkojgx/57_37/manifest.json"}
DEBUG 2022-08-09 20:35:07,280 [shard 2] cloud_storage - [fiber175~0~0|1|9964ms] - remote.cc:259 - Manifest from {panda-bucket-acc26322-1822-11ed-acdc-5f5f720889de} {key_not_found}, manifest at {"e0000000/meta/kafka/topic-unubmkojgx/57_37/manifest.json"} not found
INFO 2022-08-09 20:35:07,280 [shard 2] archival - [fiber1] - service.cc:267 - Start archiving new partition {kafka/topic-unubmkojgx/57}
DEBUG 2022-08-09 20:35:07,280 [shard 2] archival - [fiber175 kafka/topic-unubmkojgx/57] - ntp_archiver_service.cc:689 - Uploading next candidates called for {kafka/topic-unubmkojgx/57}
DEBUG 2022-08-09 20:35:07,280 [shard 2] archival - [fiber175 kafka/topic-unubmkojgx/57] - ntp_archiver_service.cc:549 - scheduling uploads, start_upload_offset: 0, last_stable_offset: 1
DEBUG 2022-08-09 20:35:07,280 [shard 2] archival - archival_policy.cc:87 - Upload policy for {kafka/topic-unubmkojgx/57} invoked, start offset: 0
DEBUG 2022-08-09 20:35:07,280 [shard 2] archival - archival_policy.cc:141 - Upload policy for {kafka/topic-unubmkojgx/57}: can't find candidate, upload deadline not reached
DEBUG 2022-08-09 20:35:07,280 [shard 2] archival - [fiber175 kafka/topic-unubmkojgx/57] - ntp_archiver_service.cc:425 - upload candidate not found, start_upload_offset: 0, last_stable_offset: 1
DEBUG 2022-08-09 20:35:07,280 [shard 2] archival - [fiber175 kafka/topic-unubmkojgx/57] - ntp_archiver_service.cc:607 - no uploads started, returning
Searching for this particular partition in the logs, it looks like this may just be that the affected partition didn't have any non-data batches? edit: This is very unlikely, given kgo-verifier randomly chooses partitions on produce, and these tests are producing ~10000 messages per partition.
$ ag 'Upload policy.*\/57\}' -c
RedpandaService-0-140401461113088/ip-172-31-40-168/redpanda.log:1428
RedpandaService-0-140401461113088/ip-172-31-47-80/redpanda.log:14
$ ag 'Upload policy.*\/57\}' RedpandaService-0-140401461113088/ip-172-31-40-168/ | tail | cut -d "]" -f 2
archival - archival_policy.cc:87 - Upload policy for {kafka/topic-isjhxhlnkv/57} invoked, start offset: 10168
archival - archival_policy.cc:162 - Upload policy for {kafka/topic-isjhxhlnkv/57}: can't find candidate, only non-data batches to upload (kafka start_offset: 10156, kafka last_stable_offset: 10156)
archival - archival_policy.cc:87 - Upload policy for {kafka/topic-isjhxhlnkv/57} invoked, start offset: 10168
archival - archival_policy.cc:162 - Upload policy for {kafka/topic-isjhxhlnkv/57}: can't find candidate, only non-data batches to upload (kafka start_offset: 10156, kafka last_stable_offset: 10156)
archival - archival_policy.cc:87 - Upload policy for {kafka/topic-isjhxhlnkv/57} invoked, start offset: 10168
archival - archival_policy.cc:162 - Upload policy for {kafka/topic-isjhxhlnkv/57}: can't find candidate, only non-data batches to upload (kafka start_offset: 10156, kafka last_stable_offset: 10156)
archival - archival_policy.cc:87 - Upload policy for {kafka/topic-isjhxhlnkv/57} invoked, start offset: 10168
archival - archival_policy.cc:162 - Upload policy for {kafka/topic-isjhxhlnkv/57}: can't find candidate, only non-data batches to upload (kafka start_offset: 10156, kafka last_stable_offset: 10156)
archival - archival_policy.cc:87 - Upload policy for {kafka/topic-isjhxhlnkv/57} invoked, start offset: 10168
archival - archival_policy.cc:162 - Upload policy for {kafka/topic-isjhxhlnkv/57}: can't find candidate, only non-data batches to upload (kafka start_offset: 10156, kafka last_stable_offset: 10156)
fabbri@bougietop:~/Devnotes/Results/extreme-recovery/tests/results/2022-08-09--002/ExtremeRecoveryTest/test_recovery_scale/11$
it looks like here
INFO 2022-08-09 20:35:19,279 [shard 2] raft - [group_id:58, {kafka/topic-unubmkojgx/57}] consensus.cc:2933 - Starting leadership transfer from {id: {2}, revision: {37}} to {id: {4}, revision: {37}} in term 1
INFO 2022-08-09 20:35:19,279 [shard 2] raft - [group_id:58, {kafka/topic-unubmkojgx/57}] consensus.cc:2836 - transfer leadership: waiting for node {id: {4}, revision: {37}} to catch up
INFO 2022-08-09 20:35:19,282 [shard 2] raft - [group_id:58, {kafka/topic-unubmkojgx/57}] consensus.cc:187 - [external_stepdown] Stepping down as leader in term 1, dirty offset 709
INFO 2022-08-09 20:35:19,286 [shard 2] raft - [group_id:58, {kafka/topic-unubmkojgx/57}] consensus.cc:2852 - transfer leadership: finished waiting on node {id: {4}, revision: {37}} recovery
WARN 2022-08-09 20:35:19,286 [shard 2] raft - [group_id:58, {kafka/topic-unubmkojgx/57}] consensus.cc:2984 - Cannot transfer leadership from non-leader
WARN 2022-08-09 20:35:19,287 [shard 2] raft - [group_id:58, {kafka/topic-unubmkojgx/57}] consensus.cc:2716 - Unable to transfer leadership to {4}: raft::errc::not_leader
DEBUG 2022-08-09 20:35:20,102 [shard 2] archival - [fiber175 kafka/topic-unubmkojgx/57] - ntp_archiver_service.cc:129 - upload loop stopped
INFO 2022-08-09 20:35:21,736 [shard 2] raft - [group_id:58, {kafka/topic-unubmkojgx/57}] vote_stm.cc:77 - vote reply from {id: {4}, revision: {37}} - {term:{2}, target_node_id{id: {2}, revision: {37}}, vote_granted: 1, log_ok:1}
INFO 2022-08-09 20:35:21,737 [shard 2] raft - [group_id:58, {kafka/topic-unubmkojgx/57}] vote_stm.cc:256 - becoming the leader term:2
INFO 2022-08-09 20:35:21,738 [shard 2] raft - [group_id:58, {kafka/topic-unubmkojgx/57}] vote_stm.cc:77 - vote reply from {id: {3}, revision: {37}} - {term:{2}, target_node_id{id: {2}, revision: {37}}, vote_granted: 1, log_ok:1}
INFO 2022-08-09 20:35:21,754 [shard 2] cluster - ntp: {kafka/topic-unubmkojgx/57} - archival_metadata_stm.cc:408 - creating snapshot at offset: 709, remote start_offset: -9223372036854775808, last_offset: -9223372036854775808
INFO 2022-08-09 20:35:21,816 [shard 2] raft - [group_id:58, {kafka/topic-unubmkojgx/57}] vote_stm.cc:271 - became the leader term:2
DEBUG 2022-08-09 20:35:31,437 [shard 2] cloud_storage - [fiber39 kafka/topic-unubmkojgx/57] - remote_partition.cc:395 - collecting stale materialized segments, 0 segments materialized, 0 segments total
DEBUG 2022-08-09 20:35:31,437 [shard 2] cloud_storage - [fiber39 kafka/topic-unubmkojgx/57] - remote_partition.cc:433 - found 0 eviction candidates
node stepped down and archiver loop stopped, then node became leader again a couple of seconds later but archiver was not started.
Later on the segment did have some data, but it did not get uploaded. The segment names logged by storage later on indicate it had large offset to have some data:
INFO 2022-08-09 20:37:11,466 [shard 2] storage - segment.cc:614 - Creating new segment /var/lib/redpanda/data/kafka/topic-unubmkojgx/57_37/9355-2-v1.log
so it should have been uploaded but wasn't.
I've reproduced this a couple of times when working on #5818.
This "scale test" for recovery from cloud storage bascially does this:
redpanda.remote.recovery
to recover each topic.The test occasionally fails in step 3: The test waits a long time (say, approx. two hours for 12 GiB of data), but fails with one missing
segmentmanifest file like this:Both times I reproduced this, there was exactly one segment missing when the test timed out and failed: