Closed aleksbykov closed 7 months ago
@gleb-cloudius , can you take a look?
Commands that are issued through raft need quorum of nodes to be accessible. This test here isolates node3 from the cluster and issue decommission through raft, so it hangs waiting for the quorum. It needs to wait with timeout, but this is a knows problem. Not sure if we have an issue though @kbr-scylla? As it is the behaviour is expected.
REMINDER: remove issues related test disabling from https://github.com/scylladb/scylla-dtest/pull/3836 when issue closed
Should be solved by @gusev-p 's quorum loss timeout patches
@temichus @aleksbykov now in raft-topology mode, the decommission attempt times out due to quorum loss. But the test fails because it expects a different error.
assert ('Rejected decommission operation' in "ToolError('Subprocess /jenkins/workspace/scylla-staging/artsiom_mishuta/dtest_raft_topolgy/dtest-full-with-consistent-topology-changes/scylla/.ccm/scylla-repository/6bd0be71ab32ea535d332b0324d81892973611c1/share/cassandra/bin/nodetool -h 127.0.36.3 -p 7199 -Dcom.sun.jndi.rmiURLParsing=legacy -Dcom.scylladb.apiPort=10000 decommission exited with non-zero status; exit status: 4; \\nstderr: error executing POST request to http://127.0.36.3:10000/storage_service/decommission with parameters {}: remote replied with status code 500 Internal Server Error:\\nservice::raft_operation_timeout_error (group [b34c4e91-ed16-11ee-ade9-f7052ddb7928] raft operation [read_barrier] timed out)\\n')" or 'Cannot start' in "ToolError('Subprocess /jenkins/workspace/scylla-staging/artsiom_mishuta/dtest_raft_topolgy/dtest-full-with-consistent-topology-changes/scylla/.ccm/scylla-repository/6bd0be71ab32ea535d332b0324d81892973611c1/share/cassandra/bin/nodetool -h 127.0.36.3 -p 7199 -Dcom.sun.jndi.rmiURLParsing=legacy -Dcom.scylladb.apiPort=10000 decommission exited with non-zero status; exit status: 4; \\nstderr: error executing POST request to http://127.0.36.3:10000/storage_service/decommission with parameters {}: remote replied with status code 500 Internal Server Error:\\nservice::raft_operation_timeout_error (group [b34c4e91-ed16-11ee-ade9-f7052ddb7928] raft operation [read_barrier] timed out)\\n')")
We can adjust the test now so it also passes when it sees raft_operation_timeout_error
in the error message.
Reassigning to you.
Installation details Scylla version (or git commit hash): 5.5.0~dev-0.20231224.da033343b793 with build-id 2917ab851e038771115f8fbee25d1a614e5ff8f7 Cluster size: 3
dtest: update_cluster_layout_tests.py::TestUpdateClusterLayout::test_decommission_node_while_gossip_partly_blocked failed by timeout: after iptables rules blocked port 7000 for node3, and node3 started decommission, the decommission process stucked on
the nodes status:
Without raft topology enabled, test passed Nodes logs: logs.tar.gz