redpanda-data / redpanda

Redpanda is a streaming data platform for developers. Kafka API compatible. 10x faster. No ZooKeeper. No JVM!
https://redpanda.com
9.39k stars 577 forks source link

CI Failure (decommissioning stopped making progress) in `NodePoolMigrationTest.test_migrating_redpanda_nodes_to_new_pool ` #19933

Open mmaslankaprv opened 2 months ago

mmaslankaprv commented 2 months ago

https://buildkite.com/redpanda/redpanda/builds/50454#01903098-f01c-43b5-9802-57743666b009

Module: rptest.tests.node_pool_migration_test
Class:  NodePoolMigrationTest
Method: test_migrating_redpanda_nodes_to_new_pool
Arguments:
{
  "balancing_mode": "off",
  "cleanup_policy": "compact",
  "test_mode": "no_tiered_storage"
}
test_id:    rptest.tests.node_pool_migration_test.NodePoolMigrationTest.test_migrating_redpanda_nodes_to_new_pool.balancing_mode=off.test_mode=TestMode.NO_TIRED_STORAGE.cleanup_policy=compact
status:     FAIL
run time:   3 minutes 56.293 seconds

    AssertionError('Node 1 decommissioning stopped making progress')
Traceback (most recent call last):
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 276, in run_test
    return self.test_context.function(self.test)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/mark/_mark.py", line 535, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 105, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/node_pool_migration_test.py", line 367, in test_migrating_redpanda_nodes_to_new_pool
    self._wait_for_nodes_removed(decommissioned_ids)
  File "/root/tests/rptest/tests/node_pool_migration_test.py", line 150, in _wait_for_nodes_removed
    return [r for r in result]
  File "/root/tests/rptest/tests/node_pool_migration_test.py", line 150, in <listcomp>
    return [r for r in result]
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
    yield _result_or_cancel(fs.pop())
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
    return fut.result(timeout)
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/root/tests/rptest/tests/node_pool_migration_test.py", line 147, in <lambda>
    lambda id: self._wait_for_node_removed(id, decommissioned_ids),
  File "/root/tests/rptest/tests/node_pool_migration_test.py", line 139, in _wait_for_node_removed
    waiter.wait_for_removal()
  File "/root/tests/rptest/utils/node_operations.py", line 205, in wait_for_removal
    assert self._made_progress(
AssertionError: Node 1 decommissioning stopped making progress

JIRA Link: CORE-4258

mmaslankaprv commented 2 months ago

The problem that causes node 1 decommission to stop is related with stuck recovery of kafka/migration-test-workload/0 partition.

recovery_stm requests a read:

TRACE 2024-06-19 13:11:34,923 [shard 1:raft] raft - [follower: {id: 6, revision: 170}] [group_id:64, {kafka/migration-test-workload/0}] - recovery_stm.cc:261 - Reading batches, starting from: 6973

It is stopped after more than 2 minutes of waiting:

TRACE 2024-06-19 13:13:52,671 [shard 1:raft] raft - [follower: {id: 6, revision: 170}] [group_id:64, {kafka/migration-test-workload/0}] - recovery_stm.cc:270 - Read no batches for recovery, stopping

The recovery continues after compaction finishes:

DEBUG 2024-06-19 13:13:52,671 [shard 1:lc  ] storage-gc - disk_log_impl.cc:790 - [{kafka/migration-test-workload/0}] Final compacted segment {offset_tracker:{term:1, base_offset:7192, committed_offset:7447, dirty_offset:7447}, compacted_segment=1, finished_self_compaction=1, finished_windowed_compaction=1, generation=89, reader={/var/lib/redpanda/data/kafka/migration-test-workload/0_41/7192-1-v1.log, (92511 bytes)}, writer=nullptr, cache=nullptr, compaction_index:nullopt, closed=0, tombstone=0, index={file:/var/lib/redpanda/data/kafka/migration-test-workload/0_41/7192-1-v1.base_index, offsets:0, index:{header_bitflags:0, base_offset:0, max_offset:7447, base_timestamp:{timestamp: 1718802677049}, max_timestamp:{timestamp: 1718802677196}, batch_timestamps_are_monotonic:1, with_offset:true, non_data_timestamps:0, broker_timestamp:{{timestamp: 1718802677191}}, num_compactible_records_appended:{22}, index(3,3,3)}, step:32768, needs_persistence:0}}
travisdowns commented 2 months ago

ci-ignore because https://redpandadata.slack.com/archives/C06MCRY75U7/p1718996592057029