redpanda-data / redpanda

Redpanda is a streaming data platform for developers. Kafka API compatible. 10x faster. No ZooKeeper. No JVM!
https://redpanda.com
9.54k stars 582 forks source link

CI Failure (decommissioning stopped making progress) in `NodesDecommissioningTest.test_flipping_decommission_recommission` #8301

Closed r-vasquez closed 1 year ago

r-vasquez commented 1 year ago

https://buildkite.com/redpanda/redpanda/builds/21437#0185c5bc-b8ec-48bc-a229-dfd05b5d6bd6/6-2504

Module: rptest.tests.nodes_decommissioning_test
Class:  NodesDecommissioningTest
Method: test_flipping_decommission_recommission
test_id:    rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_flipping_decommission_recommission
status:     FAIL
run time:   2 minutes 55.673 seconds

    AssertionError('Node 4 decommissioning stopped making progress')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/nodes_decommissioning_test.py", line 560, in test_flipping_decommission_recommission
    self._wait_for_node_removed(node_id)
  File "/root/tests/rptest/tests/nodes_decommissioning_test.py", line 149, in _wait_for_node_removed
    waiter.wait_for_removal()
  File "/root/tests/rptest/utils/node_operations.py", line 158, in wait_for_removal
    assert self._made_progress(
AssertionError: Node 4 decommissioning stopped making progress
jcsp commented 1 year ago

Same symptom here: https://buildkite.com/redpanda/redpanda/builds/21579#0185cee4-69ad-44ef-8949-59b28c14b18d

rystsov commented 1 year ago

https://buildkite.com/redpanda/redpanda/builds/21719#0185e14a-0846-43ea-bd4b-98deea9dcaaf

FAIL test: NodesDecommissioningTest.test_flipping_decommission_recommission.node_is_alive=False (2/8 runs)
  failure at 2023-01-24T02:11:27.574Z: AssertionError('Node 1 decommissioning stopped making progress')
mmaslankaprv commented 1 year ago

this is a real issue that may happen in a boundry condition when one wants to recommission a node that is offline

r-vasquez commented 1 year ago

Got this (here: https://buildkite.com/redpanda/redpanda/builds/21716#0185e0d8-eb06-4c6e-ba2a-6f3fb6d8821b/6-2207) in the same test

8339

RpkException('command /var/lib/buildkite-agent/builds/buildkite-amd64-xfs-builders-i-0175d7750481fef5a-1/redpanda/redpanda/vbuild/redpanda_installs/ci/bin/rpk --api-urls docker-rp-6:9644,docker-rp-22:9644,docker-rp-20:9644 cluster config set raft_learner_recovery_rate 1 returned 1, output: ', 'error setting property: request PUT http://docker-rp-6:9644/v1/cluster_config failed: Service Unavailable, body: "{\\"message\\": \\"Leader not available\\", \\"code\\": 503}"\n\n')
--

The "{\\"message\\": \\"Leader not available\\", \\"code\\": 503}" comes from the Admin API, do you think it's related or should I open a new Issue?

rystsov commented 1 year ago

@r-vasquez it has different stack-trace so it should be a different issue

rystsov commented 1 year ago

Probably old bits https://buildkite.com/redpanda/redpanda/builds/21837#0185e9e5-f853-46c8-a592-9b7a76d68586

FAIL test: NodesDecommissioningTest.test_flipping_decommission_recommission.node_is_alive=False (1/39 runs)
  failure at 2023-01-25T18:18:07.160Z: AssertionError('Node 1 decommissioning stopped making progress')
      on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/21837#0185e9e5-f853-46c8-a592-9b7a76d68586