redpanda-data / redpanda

Redpanda is a streaming data platform for developers. Kafka API compatible. 10x faster. No ZooKeeper. No JVM!
https://redpanda.com
9.61k stars 585 forks source link

CI Failure broken semaphore in `TestReadReplicaService.test_simple_end_to_end` #8935

Closed VladLazar closed 1 year ago

VladLazar commented 1 year ago

https://buildkite.com/redpanda/redpanda/builds/23389#01865ae4-f555-4215-b16e-b077c56c4c24

I don't think this issue is specific to a test. Seems to be a more general problem with the cloud upload path.

test_id:    rptest.tests.read_replica_e2e_test.TestReadReplicaService.test_simple_end_to_end.partition_count=10.min_records=10000.cloud_storage_type=CloudStorageType.S3
status:     FAIL
run time:   1 minute 5.745 seconds

test_id:    rptest.tests.multi_restarts_with_archival_test.MultiRestartTest.test_recovery_after_multiple_restarts.cloud_storage_type=CloudStorageType.ABS
status:     FAIL
run time:   3 minutes 48.806 seconds
test_id:    rptest.tests.read_replica_e2e_test.TestReadReplicaService.test_simple_end_to_end.partition_count=10.min_records=10000.cloud_storage_type=CloudStorageType.S3
status:     FAIL
run time:   1 minute 5.745 seconds

    <BadLogLines nodes=docker-rp-12(1) example="ERROR 2023-02-16 16:17:32,171 [shard 1] archival - [fiber19 kafka/panda-topic/5] - ntp_archiver_service.cc:188 - upload loop error: seastar::broken_semaphore (Semaphore broken)">
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 67, in wrapped
    self.redpanda.raise_on_bad_logs(allow_list=log_allow_list)
  File "/root/tests/rptest/services/redpanda.py", line 1770, in raise_on_bad_logs
    raise BadLogLines(bad_lines)
rptest.services.utils.BadLogLines: <BadLogLines nodes=docker-rp-12(1) example="ERROR 2023-02-16 16:17:32,171 [shard 1] archival - [fiber19 kafka/panda-topic/5] - ntp_archiver_service.cc:188 - upload loop error: seastar::broken_semaphore (Semaphore broken)">

----------------------------------------------------------------------------------------------------
test_id:    rptest.tests.multi_restarts_with_archival_test.MultiRestartTest.test_recovery_after_multiple_restarts.cloud_storage_type=CloudStorageType.ABS
status:     FAIL
run time:   3 minutes 48.806 seconds

    <BadLogLines nodes=docker-rp-1(2) example="ERROR 2023-02-16 16:31:12,830 [shard 1] archival - [fiber17 kafka/topic-trtsfnplbn/9] - ntp_archiver_service.cc:188 - upload loop error: seastar::broken_semaphore (Semaphore broken)">
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 67, in wrapped
    self.redpanda.raise_on_bad_logs(allow_list=log_allow_list)
  File "/root/tests/rptest/services/redpanda.py", line 1770, in raise_on_bad_logs
    raise BadLogLines(bad_lines)
rptest.services.utils.BadLogLines: <BadLogLines nodes=docker-rp-1(2) example="ERROR 2023-02-16 16:31:12,830 [shard 1] archival - [fiber17 kafka/topic-trtsfnplbn/9] - ntp_archiver_service.cc:188 - upload loop error: seastar::broken_semaphore (Semaphore broken)">
andrwng commented 1 year ago

Looks like these might be related to the graceful leadership transfer work that landed recently:

DEBUG 2023-02-16 16:31:12,830 [shard 1] archival - upload_housekeeping_service.cc:210 - deregistered job, current backlog 0, num completed jobs 0, num running jobs 0
DEBUG 2023-02-16 16:31:12,830 [shard 0] cluster - controller_backend.cc:863 - [{kafka/topic-trtsfnplbn/1}] (retry 0) finished operation: {type: update_properties, revision: 20, assignment: { id: 1, group_id: 2, replicas: {{node_id: 3, shard: 1}, {node_id: 2, shard: 1}, {node_id: 1, shard: 1}} }, previous assignment: {nullopt}}
DEBUG 2023-02-16 16:31:12,830 [shard 1] cluster - controller_backend.cc:863 - [{kafka/topic-trtsfnplbn/2}] (retry 0) finished operation: {type: update_properties, revision: 20, assignment: { id: 2, group_id: 3, replicas: {{node_id: 1, shard: 0}, {node_id: 3, shard: 0}, {node_id: 2, shard: 0}} }, previous assignment: {nullopt}}
ERROR 2023-02-16 16:31:12,830 [shard 1] archival - [fiber17 kafka/topic-trtsfnplbn/9] - ntp_archiver_service.cc:188 - upload loop error: seastar::broken_semaphore (Semaphore broken)
DEBUG 2023-02-16 16:31:12,830 [shard 0] cluster - controller_backend.cc:863 - [{kafka/topic-trtsfnplbn/0}] (retry 0) finished operation: {type: update_properties, revision: 20, assignment: { id: 0, group_id: 1, replicas: {{node_id: 1, shard: 1}, {node_id: 2, shard: 1}, {node_id: 3, shard: 1}} }, previous assignment: {nullopt}}
TRACE 2023-02-16 16:31:12,830 [shard 1] archival - [fiber14 kafka/topic-trtsfnplbn/3] - ntp_archiver_service.cc:159 - upload loop shutting down
DEBUG 2023-02-16 16:31:12,830 [shard 1] cluster - controller_backend.cc:863 - [{kafka/topic-trtsfnplbn/8}] (retry 0) finished operation: {type: update_properties, revision: 20, assignment: { id: 8, group_id: 9, replicas: {{node_id: 3, shard: 0}, {node_id: 2, shard: 0}, {node_id: 1, shard: 0}} }, previous assignment: {nullopt}}
TRACE 2023-02-16 16:31:12,831 [shard 1] archival - [fiber15 kafka/topic-trtsfnplbn/1] - ntp_archiver_service.cc:159 - upload loop shutting down
DEBUG 2023-02-16 16:31:12,831 [shard 0] archival - ntp_archiver_service.cc:113 - created ntp_archiver {kafka/topic-trtsfnplbn/4} in term 1
DEBUG 2023-02-16 16:31:12,832 [shard 1] archival - ntp_archiver_service.cc:113 - created ntp_archiver {kafka/topic-trtsfnplbn/3} in term 1
DEBUG 2023-02-16 16:31:12,833 [shard 0] archival - [fiber14 kafka/topic-trtsfnplbn/4] - ntp_archiver_service.cc:151 - upload loop waiting for leadership/unpause
DEBUG 2023-02-16 16:31:12,833 [shard 0] cluster - controller_backend.cc:863 - [{kafka/topic-trtsfnplbn/3}] (retry 0) finished operation: {type: update_properties, revision: 20, assignment: { id: 3, group_id: 4, replicas: {{node_id: 2, shard: 1}, {node_id: 1, shard: 1}, {node_id: 3, shard: 1}} }, previous assignment: {nullopt}}
DEBUG 2023-02-16 16:31:12,834 [shard 1] archival - [fiber20 kafka/topic-trtsfnplbn/3] - ntp_archiver_service.cc:151 - upload loop waiting for leadership/unpause
DEBUG 2023-02-16 16:31:12,834 [shard 0] archival - ntp_archiver_service.cc:113 - created ntp_archiver {kafka/topic-trtsfnplbn/6} in term 0
DEBUG 2023-02-16 16:31:12,835 [shard 1] archival - ntp_archiver_service.cc:113 - created ntp_archiver {kafka/topic-trtsfnplbn/1} in term 1
DEBUG 2023-02-16 16:31:12,836 [shard 0] archival - [fiber15 kafka/topic-trtsfnplbn/6] - ntp_archiver_service.cc:151 - upload loop waiting for leadership/unpause
ERROR 2023-02-16 16:31:12,836 [shard 0] archival - [fiber13 kafka/topic-trtsfnplbn/2] - ntp_archiver_service.cc:188 - upload loop error: seastar::broken_semaphore (Semaphore broken)
DEBUG 2023-02-16 16:31:12,837 [shard 1] archival - [fiber21 kafka/topic-trtsfnplbn/1] - ntp_archiver_service.cc:151 - upload loop waiting for leadership/unpause
DEBUG 2023-02-16 16:31:12,837 [shard 0] archival - ntp_archiver_service.cc:113 - created ntp_archiver {kafka/topic-trtsfnplbn/8} in term 1