extreme_recovery_test.py fails to upload one manifest

ajfabbri commented 2 years ago

I've reproduced this a couple of times when working on #5818.

This "scale test" for recovery from cloud storage bascially does this:

Create two topics with "many" partitions (200-600) each.
Use kgo-verifier to produce ~10 GiB of messages to the topics.
Wait until all expected data is uploaded to s3.
Destroy original source cluster. Create a new cluster, and uses redpanda.remote.recovery to recover each topic.
Use kgo-verifier to consume data from the topics and validate it.

The test occasionally fails in step 3: The test waits a long time (say, approx. two hours for 12 GiB of data), but fails with one missing ~~segment~~ manifest file like this:

[DEBUG - 2022-08-09 22:15:50,581 - topic_recovery_test - verify - lineno:984]: checking S3 object: fff49101/kafka/topic-isjhxhlnkv/593_39/9599-1-v1.log.1                                                                                                                       
[DEBUG - 2022-08-09 22:15:50,581 - topic_recovery_test - verify - lineno:984]: checking S3 object: fff678dc/kafka/topic-unubmkojgx/15_37/3810-2-v1.log.2                                                                                                                        
[DEBUG - 2022-08-09 22:15:50,581 - topic_recovery_test - verify - lineno:984]: checking S3 object: fff9b810/kafka/topic-unubmkojgx/397_37/8751-1-v1.log.1                                                                                                                       
[DEBUG - 2022-08-09 22:15:50,581 - topic_recovery_test - verify - lineno:984]: checking S3 object: ffff9a72/kafka/topic-isjhxhlnkv/107_39/9713-1-v1.log.1                                                                                                                       
[INFO  - 2022-08-09 22:15:50,585 - topic_recovery_test - verify - lineno:998]: can't find enough manifest.json objects, manifest found: 1199, expected partitions: 1200, [ S3ObjectMetadata(Bucket='panda-bucket-acc26322-1822-11ed-acdc-5f5f720889de', Key='00000000/meta/kafka
/topic-isjhxhlnkv/119_39/manifest.json', ETag='33eed0436a047755dcf819c5566b39a8', ContentLength=2695),
...
[INFO  - 2022-08-09 22:18:21,747 - runner_client - log - lineno:278]: RunnerClient: rptest.scale_tests.extreme_recovery_test.ExtremeRecoveryTest.test_recovery_scale: FAIL: TimeoutError('objects not found in S3')                                                             
Traceback (most recent call last):                                                                                                                                                                                                                                              
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 135, in run                                                                                                                                                                     
    data = self.run_test()                                                                                                                                                                                                                                                      
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 227, in run_test                                                                                                                                                                
    return self.test_context.function(self.test)                                                                                                                                                                                                                                
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 35, in wrapped                                                                                                                                                                                            
    r = f(self, *args, **kwargs)                                                                                                                                                                                                                                                
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/extreme_recovery_test.py", line 184, in test_recovery_scale                                                                                                                                                              
    self.do_run(extremeRecovery, int(RecoveryScale.expected_transfer_sec))                                                                                                                                                                                                      
  File "/home/ubuntu/redpanda/tests/rptest/tests/topic_recovery_test.py", line 1094, in do_run                                                                                                                                                                                  
    self._wait_for_data_in_s3(                                                                                                                                                                                                                                                  
  File "/home/ubuntu/redpanda/tests/rptest/tests/topic_recovery_test.py", line 1035, in _wait_for_data_in_s3                                                                                                                                                                    
    wait_until(verify,                                                                                                                                                                                                                                                          
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/utils/util.py", line 58, in wait_until                                                                                                                                                                        
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception                                                                                                                                                                                         
ducktape.errors.TimeoutError: objects not found in S3

Both times I reproduced this, there was exactly one segment missing when the test timed out and failed:

can't find enough manifest.json objects, manifest found: 1199, expected partitions: 1200,

ajfabbri commented 2 years ago

Missing manifest here was 40000000/meta/kafka/topic-unubmkojgx/57_37/manifest.json

RedpandaService-0-140401461113088/ip-172-31-47-80/redpanda.log
8621:INFO  2022-08-09 20:35:07,197 [shard 2] storage - segment.cc:614 - Creating new segment /var/lib/redpanda/data/kafka/topic-unubmkojgx/57_37/0-1-v1.log
9023:DEBUG 2022-08-09 20:35:07,244 [shard 2] cloud_storage - [fiber175~0~0|1|10000ms] - remote.cc:203 - Download manifest "e0000000/meta/kafka/topic-unubmkojgx/57_37/manifest.json"
9376:INFO  2022-08-09 20:35:07,280 [shard 2] cloud_storage - [fiber175~0~0|1|9964ms] - remote.cc:74 - NoSuchKey response received {"e0000000/meta/kafka/topic-unubmkojgx/57_37/manifest.json"}
9377:DEBUG 2022-08-09 20:35:07,280 [shard 2] cloud_storage - [fiber175~0~0|1|9964ms] - remote.cc:259 - Manifest from {panda-bucket-acc26322-1822-11ed-acdc-5f5f720889de} {key_not_found}, manifest at {"e0000000/meta/kafka/topic-unubmkojgx/57_37/manifest.json"} not found
...                                
INFO  2022-08-09 20:35:07,280 [shard 2] cloud_storage - [fiber175~0~0|1|9964ms] - remote.cc:74 - NoSuchKey response received {"e0000000/meta/kafka/topic-unubmkojgx/57_37/manifest.json"}                                                                              
DEBUG 2022-08-09 20:35:07,280 [shard 2] cloud_storage - [fiber175~0~0|1|9964ms] - remote.cc:259 - Manifest from {panda-bucket-acc26322-1822-11ed-acdc-5f5f720889de} {key_not_found}, manifest at {"e0000000/meta/kafka/topic-unubmkojgx/57_37/manifest.json"} not found
INFO  2022-08-09 20:35:07,280 [shard 2] archival - [fiber1] - service.cc:267 - Start archiving new partition {kafka/topic-unubmkojgx/57}                                                                                                           
DEBUG 2022-08-09 20:35:07,280 [shard 2] archival - [fiber175 kafka/topic-unubmkojgx/57] - ntp_archiver_service.cc:689 - Uploading next candidates called for {kafka/topic-unubmkojgx/57}                                               
DEBUG 2022-08-09 20:35:07,280 [shard 2] archival - [fiber175 kafka/topic-unubmkojgx/57] - ntp_archiver_service.cc:549 - scheduling uploads, start_upload_offset: 0, last_stable_offset: 1                 
DEBUG 2022-08-09 20:35:07,280 [shard 2] archival - archival_policy.cc:87 - Upload policy for {kafka/topic-unubmkojgx/57} invoked, start offset: 0                                                        
DEBUG 2022-08-09 20:35:07,280 [shard 2] archival - archival_policy.cc:141 - Upload policy for {kafka/topic-unubmkojgx/57}: can't find candidate, upload deadline not reached                     
DEBUG 2022-08-09 20:35:07,280 [shard 2] archival - [fiber175 kafka/topic-unubmkojgx/57] - ntp_archiver_service.cc:425 - upload candidate not found, start_upload_offset: 0, last_stable_offset: 1
DEBUG 2022-08-09 20:35:07,280 [shard 2] archival - [fiber175 kafka/topic-unubmkojgx/57] - ntp_archiver_service.cc:607 - no uploads started, returning

ajfabbri commented 2 years ago

Searching for this particular partition in the logs, it looks like this may just be that the affected partition didn't have any non-data batches? edit: This is very unlikely, given kgo-verifier randomly chooses partitions on produce, and these tests are producing ~10000 messages per partition.

$ ag 'Upload policy.*\/57\}' -c
RedpandaService-0-140401461113088/ip-172-31-40-168/redpanda.log:1428
RedpandaService-0-140401461113088/ip-172-31-47-80/redpanda.log:14
$ ag 'Upload policy.*\/57\}' RedpandaService-0-140401461113088/ip-172-31-40-168/ | tail | cut -d "]" -f 2
 archival - archival_policy.cc:87 - Upload policy for {kafka/topic-isjhxhlnkv/57} invoked, start offset: 10168
 archival - archival_policy.cc:162 - Upload policy for {kafka/topic-isjhxhlnkv/57}: can't find candidate, only non-data batches to upload (kafka start_offset: 10156, kafka last_stable_offset: 10156)
 archival - archival_policy.cc:87 - Upload policy for {kafka/topic-isjhxhlnkv/57} invoked, start offset: 10168
 archival - archival_policy.cc:162 - Upload policy for {kafka/topic-isjhxhlnkv/57}: can't find candidate, only non-data batches to upload (kafka start_offset: 10156, kafka last_stable_offset: 10156)
 archival - archival_policy.cc:87 - Upload policy for {kafka/topic-isjhxhlnkv/57} invoked, start offset: 10168
 archival - archival_policy.cc:162 - Upload policy for {kafka/topic-isjhxhlnkv/57}: can't find candidate, only non-data batches to upload (kafka start_offset: 10156, kafka last_stable_offset: 10156)
 archival - archival_policy.cc:87 - Upload policy for {kafka/topic-isjhxhlnkv/57} invoked, start offset: 10168
 archival - archival_policy.cc:162 - Upload policy for {kafka/topic-isjhxhlnkv/57}: can't find candidate, only non-data batches to upload (kafka start_offset: 10156, kafka last_stable_offset: 10156)
 archival - archival_policy.cc:87 - Upload policy for {kafka/topic-isjhxhlnkv/57} invoked, start offset: 10168
 archival - archival_policy.cc:162 - Upload policy for {kafka/topic-isjhxhlnkv/57}: can't find candidate, only non-data batches to upload (kafka start_offset: 10156, kafka last_stable_offset: 10156)
fabbri@bougietop:~/Devnotes/Results/extreme-recovery/tests/results/2022-08-09--002/ExtremeRecoveryTest/test_recovery_scale/11$

abhijat commented 2 years ago

it looks like here

INFO  2022-08-09 20:35:19,279 [shard 2] raft - [group_id:58, {kafka/topic-unubmkojgx/57}] consensus.cc:2933 - Starting leadership transfer from {id: {2}, revision: {37}} to {id: {4}, revision: {37}} in term 1
INFO  2022-08-09 20:35:19,279 [shard 2] raft - [group_id:58, {kafka/topic-unubmkojgx/57}] consensus.cc:2836 - transfer leadership: waiting for node {id: {4}, revision: {37}} to catch up
INFO  2022-08-09 20:35:19,282 [shard 2] raft - [group_id:58, {kafka/topic-unubmkojgx/57}] consensus.cc:187 - [external_stepdown] Stepping down as leader in term 1, dirty offset 709
INFO  2022-08-09 20:35:19,286 [shard 2] raft - [group_id:58, {kafka/topic-unubmkojgx/57}] consensus.cc:2852 - transfer leadership: finished waiting on node {id: {4}, revision: {37}} recovery
WARN  2022-08-09 20:35:19,286 [shard 2] raft - [group_id:58, {kafka/topic-unubmkojgx/57}] consensus.cc:2984 - Cannot transfer leadership from non-leader
WARN  2022-08-09 20:35:19,287 [shard 2] raft - [group_id:58, {kafka/topic-unubmkojgx/57}] consensus.cc:2716 - Unable to transfer leadership to {4}: raft::errc::not_leader
DEBUG 2022-08-09 20:35:20,102 [shard 2] archival - [fiber175 kafka/topic-unubmkojgx/57] - ntp_archiver_service.cc:129 - upload loop stopped
INFO  2022-08-09 20:35:21,736 [shard 2] raft - [group_id:58, {kafka/topic-unubmkojgx/57}] vote_stm.cc:77 - vote reply from {id: {4}, revision: {37}} - {term:{2}, target_node_id{id: {2}, revision: {37}}, vote_granted: 1, log_ok:1}
INFO  2022-08-09 20:35:21,737 [shard 2] raft - [group_id:58, {kafka/topic-unubmkojgx/57}] vote_stm.cc:256 - becoming the leader term:2
INFO  2022-08-09 20:35:21,738 [shard 2] raft - [group_id:58, {kafka/topic-unubmkojgx/57}] vote_stm.cc:77 - vote reply from {id: {3}, revision: {37}} - {term:{2}, target_node_id{id: {2}, revision: {37}}, vote_granted: 1, log_ok:1}
INFO  2022-08-09 20:35:21,754 [shard 2] cluster - ntp: {kafka/topic-unubmkojgx/57} - archival_metadata_stm.cc:408 - creating snapshot at offset: 709, remote start_offset: -9223372036854775808, last_offset: -9223372036854775808
INFO  2022-08-09 20:35:21,816 [shard 2] raft - [group_id:58, {kafka/topic-unubmkojgx/57}] vote_stm.cc:271 - became the leader term:2
DEBUG 2022-08-09 20:35:31,437 [shard 2] cloud_storage - [fiber39 kafka/topic-unubmkojgx/57] - remote_partition.cc:395 - collecting stale materialized segments, 0 segments materialized, 0 segments total
DEBUG 2022-08-09 20:35:31,437 [shard 2] cloud_storage - [fiber39 kafka/topic-unubmkojgx/57] - remote_partition.cc:433 - found 0 eviction candidates

node stepped down and archiver loop stopped, then node became leader again a couple of seconds later but archiver was not started.

Later on the segment did have some data, but it did not get uploaded. The segment names logged by storage later on indicate it had large offset to have some data:

INFO  2022-08-09 20:37:11,466 [shard 2] storage - segment.cc:614 - Creating new segment /var/lib/redpanda/data/kafka/topic-unubmkojgx/57_37/9355-2-v1.log

so it should have been uploaded but wasn't.

redpanda-data / redpanda

extreme_recovery_test.py fails to upload one manifest #5928