Open aleksbykov opened 1 month ago
@aleksbykov isn't this the same as https://github.com/scylladb/scylladb/issues/20754? You also reported a connection close
there.
@kbr-scylla , the case looks very similar, but the error message is different:
raft_topology - topology change coordinator fiber got error std::runtime_error (raft topology: exec_global_command(barrier) failed with seastar::rpc::closed_error (connection is closed))
Seen this happening on master run this week
Base Scylla version: 6.2.0~rc1-20240919.a71d4bc49cc8
with build-id b4036257ffcab230cd320b1b62fa05de35460c13
Target Scylla version (or git commit hash): 6.3.0~dev-20241004.882a3c60e4a5
with build-id 18d05b9776a41807ef6d1e3080c8ebb1a2257831
Kernel Version: 6.8.0-1016-aws
Cluster size: 4 nodes (im4gn.2xlarge)
Scylla Nodes used in this run:
OS / Image: ami-0629b30bb6e5459a6
(aws: undefined_region)
Test: rolling-upgrade-ami-arm-test
Test id: c36c4f3f-74f5-464d-af80-385bea38caa4
Test name: scylla-master/rolling-upgrade/rolling-upgrade-ami-arm-test
Test method: upgrade_test.UpgradeTest.test_rolling_upgrade
Test config file(s):
The node saw that every node in the cluster supports a new feature after rolling upgrade completed, and it was trying to execute global barrier before marking the feature as enabled. Unfortunately this raced with gossiper, which didn't see the node as UP/restarted yet. When gossiper sees node as restarted it resets connections, this caused the barrier to fail. Then it retried and succeeded.
Similar to https://github.com/scylladb/scylladb/issues/20588#issuecomment-2363492593.
One way to mitigate this would be to modify topology_coordinator::exec_global_command
to wait for nodes to be marked as UP before executing the global command (in this case barrier). Or not report an ERROR if barrier fails due to rpc::closed_error
.
Packages
Base Scylla version:
6.1.2-20240915.b60f9ef4c223
with build-idc713ac9e819492d7560aa3ad461c43cf404c977b
Target Scylla version (or git commit hash):6.2.0~rc2-20241002.93700ff5d1ce
with build-id57e6a907e0a03a185c4145bb8c7d6316b519b14c
Kernel Version:
6.8.0-1015-aws
Issue description
during rolling upgrade when latest node was started after upgrade, topology coordinator node print error message:
This error appeared on topology coordinator at moment when latest node was upgraded and moved to normal state:
But before scylla reported that it is ready:
Impact
No impact on user except annoying error message which could trigger some alert. Upgrade finished without any other errors.
How frequently does it reproduce?
Describe the frequency with how this issue can be reproduced.
Installation details
Cluster size: 4 nodes (i4i.2xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-0e6d50d10ca1e8aeb
(aws: undefined_region)Test:
rolling-upgrade-ami-test
Test id:4c08c3db-5eb4-4c54-a3b5-0efa6cc6c991
Test name:scylla-6.2/rolling-upgrade/rolling-upgrade-ami-test
Test method:upgrade_test.UpgradeTest.test_rolling_upgrade
Test config file(s):Logs and commands
- Restore Monitor Stack command: `$ hydra investigate show-monitor 4c08c3db-5eb4-4c54-a3b5-0efa6cc6c991` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=4c08c3db-5eb4-4c54-a3b5-0efa6cc6c991) - Show all stored logs command: `$ hydra investigate show-logs 4c08c3db-5eb4-4c54-a3b5-0efa6cc6c991` ## Logs: - **db-cluster-4c08c3db.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/4c08c3db-5eb4-4c54-a3b5-0efa6cc6c991/20241003_001102/db-cluster-4c08c3db.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/4c08c3db-5eb4-4c54-a3b5-0efa6cc6c991/20241003_001102/db-cluster-4c08c3db.tar.gz) - **sct-runner-events-4c08c3db.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/4c08c3db-5eb4-4c54-a3b5-0efa6cc6c991/20241003_001102/sct-runner-events-4c08c3db.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/4c08c3db-5eb4-4c54-a3b5-0efa6cc6c991/20241003_001102/sct-runner-events-4c08c3db.tar.gz) - **sct-4c08c3db.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/4c08c3db-5eb4-4c54-a3b5-0efa6cc6c991/20241003_001102/sct-4c08c3db.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/4c08c3db-5eb4-4c54-a3b5-0efa6cc6c991/20241003_001102/sct-4c08c3db.log.tar.gz) - **loader-set-4c08c3db.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/4c08c3db-5eb4-4c54-a3b5-0efa6cc6c991/20241003_001102/loader-set-4c08c3db.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/4c08c3db-5eb4-4c54-a3b5-0efa6cc6c991/20241003_001102/loader-set-4c08c3db.tar.gz) - **monitor-set-4c08c3db.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/4c08c3db-5eb4-4c54-a3b5-0efa6cc6c991/20241003_001102/monitor-set-4c08c3db.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/4c08c3db-5eb4-4c54-a3b5-0efa6cc6c991/20241003_001102/monitor-set-4c08c3db.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-6.2/job/rolling-upgrade/job/rolling-upgrade-ami-test/4/) [Argus](https://argus.scylladb.com/test/86b69cab-d2c8-43f3-87ce-8e23bbe75848/runs?additionalRuns[]=4c08c3db-5eb4-4c54-a3b5-0efa6cc6c991)