scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.6k stars 1.3k forks source link

Error message on topology coordinator 'raft_topology - topology change coordinator fiber got error std::runtime_error (connection close)' when latest node upgraded to target version #20950

Open aleksbykov opened 1 month ago

aleksbykov commented 1 month ago

Packages

Base Scylla version: 6.1.2-20240915.b60f9ef4c223 with build-id c713ac9e819492d7560aa3ad461c43cf404c977b Target Scylla version (or git commit hash): 6.2.0~rc2-20241002.93700ff5d1ce with build-id 57e6a907e0a03a185c4145bb8c7d6316b519b14c

Kernel Version: 6.8.0-1015-aws

Issue description

during rolling upgrade when latest node was started after upgrade, topology coordinator node print error message:

2024-10-02T23:41:49.557+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-1     !INFO | scylla[16912]:  [shard 0:main] raft_group_registry - marking Raft server be7c00bc-8e5d-408c-b35f-f44a3af79fe8 as alive for raft groups
2024-10-02T23:41:50.010+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-1     !INFO | scylla[16912]:  [shard 0:comp] compaction - [Compact system.topology e4f950b0-8117-11ef-ae48-ff317288c316] Compacting [/var/lib/scylla/data/system/topology-5be1feb3929e3df98da8d771c36129a0/me-3gk1_1ttp_4fw682newov52olixi-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/system/topology-5be1feb3929e3df98da8d771c36129a0/me-3gk1_1t0g_4r1eo2newov52olixi-big-Data.db:level=0:origin=compaction]
2024-10-02T23:41:50.010+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-1     !INFO | scylla[16912]:  [shard 0:comp] sstable - Rebuilding bloom filter /var/lib/scylla/data/system/topology-5be1feb3929e3df98da8d771c36129a0/me-3gk1_1ttp_4i1c02newov52olixi-big-Filter.db: resizing bitset from 328 bytes to 8 bytes. sstable origin: compaction
2024-10-02T23:41:50.010+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-1     !INFO | scylla[16912]:  [shard 0: gms] raft_topology - executing global topology command barrier, excluded nodes: {3052e263-5fbe-4f43-ad1b-0e26eaab819a}
2024-10-02T23:41:50.010+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-1     !INFO | scylla[16912]:  [shard 0:comp] compaction - [Compact system.topology e4f950b0-8117-11ef-ae48-ff317288c316] Compacted 2 sstables to [/var/lib/scylla/data/system/topology-5be1feb3929e3df98da8d771c36129a0/me-3gk1_1ttp_4i1c02newov52olixi-big-Data.db:level=0]. 42kB to 33kB (~79% of original) in 38ms = 1MB/s. ~256 total partitions merged to 1.
2024-10-02T23:41:50.010+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-1     !INFO | scylla[16912]:  [shard 0: gms] gossip - Node 10.4.3.186 has restarted, now UP, status = NORMAL
2024-10-02T23:41:50.010+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-1  !WARNING | scylla[16912]:  [shard 0: gms] gossip - Fail to send EchoMessage to 10.4.3.186: seastar::rpc::closed_error (connection is closed)
2024-10-02T23:41:50.010+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-1      !ERR | scylla[16912]:  [shard 0: gms] raft_topology - topology change coordinator fiber got error std::runtime_error (raft topology: exec_global_command(barrier) failed with seastar::rpc::closed_error (connection is closed))
2024-10-02T23:41:51.012+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-1     !INFO | scylla[16912]:  [shard 0: gms] gossip - InetAddress be7c00bc-8e5d-408c-b35f-f44a3af79fe8/10.4.3.186 is now UP, status = NORMAL
2024-10-02T23:41:51.012+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-1     !INFO | scylla[16912]:  [shard 0: gms] raft_topology - executing global topology command barrier, excluded nodes: {3052e263-5fbe-4f43-ad1b-0e26eaab819a}
2024-10-02T23:41:51.307+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-1     !INFO | scylla[16912]:  [shard 0: gms] raft_topology - updating topology state: enabling features: {"FRAGMENTED_COMMITLOG_ENTRIES", "MAINTENANCE_TENANT", "NATIVE_REVERSE_QUERIES", "TOPOLOGY_REQUESTS_TYPE_COLUMN", "VIEW_BUILD_STATUS_ON_GROUP0", "ZERO_TOKEN_NODES"}
2024-10-02T23:41:51.307+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-1     !INFO | scylla[16912]:  [shard 0:comp] compaction - [Compact system.topology e5c35770-8117-11ef-ae48-ff317288c316] Compacting [/var/lib/scylla/data/system/topology-5be1feb3929e3df98da8d771c36129a0/me-3gk1_1ttr_07acg2newov52olixi-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/system/topology-5be1feb3929e3df98da8d771c36129a0/me-3gk1_1ttp_4i1c02newov52olixi-big-Data.db:level=0:origin=compaction]

This error appeared on topology coordinator at moment when latest node was upgraded and moved to normal state:

2024-10-02T23:41:49.866+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 0: gms] gossip - Node 10.4.2.195 has restarted, now UP, status = NORMAL
2024-10-02T23:41:49.866+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 0: gms] gossip - Node 10.4.0.222 has restarted, now UP, status = NORMAL
2024-10-02T23:41:49.866+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 0: gms] gossip - Node 10.4.1.145 has restarted, now UP, status = NORMAL
2024-10-02T23:41:50.142+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 0: gms] gossip - InetAddress 3052e263-5fbe-4f43-ad1b-0e26eaab819a/10.4.2.195 is now UP, status = NORMAL
2024-10-02T23:41:50.142+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 0: gms] gossip - InetAddress 5a34f43f-e648-4baf-b06e-67d529afd29c/10.4.0.222 is now UP, status = NORMAL
2024-10-02T23:41:50.142+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 0:comp] compaction - [Compact system.topology e5103410-8117-11ef-b104-464b0904877d] Compacting [/var/lib/scylla/data/system/topology-5be1feb3929e3df98da8d771c36129a0/me-3gk1_1ttp_4w6lc2owqqzjvymyrh-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/system/topology-5be1feb3929e3df98da8d771c36129a0/me-3gk1_1tg0_4dr0h2p9rz8a3el6ol-big-Data.db:level=0:origin=compaction]
2024-10-02T23:41:50.142+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 0: gms] gossip - InetAddress a44155f9-c4fb-45a6-b00c-e0e70542f508/10.4.1.145 is now UP, status = NORMAL
2024-10-02T23:41:50.142+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 0:comp] compaction - [Compact ks1.table1_col7_idx_index e44a24f0-8117-11ef-b104-464b0904877d] Compacted 2 sstables to [/var/lib/scylla/data/ks1/table1_col7_idx_index-229953d1810811efa9249bca9b14cabd/me-3gk1_1tto_3mj2o2owqqzjvymyrh-big-Data.db:level=0]. 10MB to 10MB (~99% of original) in 1285ms = 8MB/s. ~25984 total partitions merged to 25596.
2024-10-02T23:41:50.142+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 0:comp] compaction - [Compact ks1.table1_col7_idx_index e514c7f0-8117-11ef-b104-464b0904877d] Compacting [/var/lib/scylla/data/ks1/table1_col7_idx_index-229953d1810811efa9249bca9b14cabd/me-3gk1_1ts3_0td3k2p9rz8a3el6ol-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/ks1/table1_col7_idx_index-229953d1810811efa9249bca9b14cabd/me-3gk1_1tqe_2d39t2p9rz8a3el6ol-big-Data.db:level=0:origin=compaction]
2024-10-02T23:41:50.142+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 0:comp] sstable - Rebuilding bloom filter /var/lib/scylla/data/system/topology-5be1feb3929e3df98da8d771c36129a0/me-3gk1_1ttp_5em682owqqzjvymyrh-big-Filter.db: resizing bitset from 328 bytes to 8 bytes. sstable origin: compaction
2024-10-02T23:41:50.143+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 0:comp] compaction - [Compact system.topology e5103410-8117-11ef-b104-464b0904877d] Compacted 2 sstables to [/var/lib/scylla/data/system/topology-5be1feb3929e3df98da8d771c36129a0/me-3gk1_1ttp_5em682owqqzjvymyrh-big-Data.db:level=0]. 42kB to 33kB (~79% of original) in 102ms = 415kB/s. ~256 total partitions merged to 1.
2024-10-02T23:41:50.143+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 6:comp] sstable - Rebuilding bloom filter /var/lib/scylla/data/ks1/table1_col6_idx_index-217b60b0810811efb1ad9425ad2d6f15/me-3gk1_1ttp_4nto128e0skcu5ww3h-big-Filter.db: resizing bitset from 7848 bytes to 5072 bytes. sstable origin: compaction
2024-10-02T23:41:50.143+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 6:comp] compaction - [Compact ks1.table1_col6_idx_index e4fd9670-8117-11ef-92d7-46450904877d] Compacted 2 sstables to [/var/lib/scylla/data/ks1/table1_col6_idx_index-217b60b0810811efb1ad9425ad2d6f15/me-3gk1_1ttp_4nto128e0skcu5ww3h-big-Data.db:level=0]. 9MB to 9MB (~100% of original) in 272ms = 35MB/s. ~6272 total partitions merged to 4050.
2024-10-02T23:41:50.642+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 0:strm] raft_topology - refreshing topology to check if it's synchronized with local metadata
2024-10-02T23:41:50.642+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 0:strm] storage_service - entering NORMAL mode
2024-10-02T23:41:50.642+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 0:strm] raft_group0 - finish_setup_after_join: group 0 ID present, loading server info.
2024-10-02T23:41:50.642+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 0:strm] raft_group0 - finish_setup_after_join: SUPPORTS_RAFT feature enabled. Starting internal upgrade-to-raft procedure.
2024-10-02T23:41:50.642+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 0:strm] raft_group0_upgrade - Already upgraded.
2024-10-02T23:41:50.642+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 0:strm] storage_service - Starting the tablet split monitor...
2024-10-02T23:41:50.893+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 3:comp] compaction - [Compact ks1.table1_col7_idx_index e4e5efc0-8117-11ef-abdb-46460904877d] Compacted 2 sstables to [/var/lib/scylla/data/ks1/table1_col7_idx_index-229953d1810811efa9249bca9b14cabd/me-3gk1_1ttp_3r1402m31ksovlmf7h-big-Data.db:level=0]. 10MB to 10MB (~99% of original) in 997ms = 10MB/s. ~25600 total partitions merged to 25203.
2024-10-02T23:41:50.893+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 3:comp] compaction - [Compact ks1.table1_col6_idx_index e580a920-8117-11ef-abdb-46460904877d] Compacting [/var/lib/scylla/data/ks1/table1_col6_idx_index-217b60b0810811efb1ad9425ad2d6f15/me-3gk1_1ts4_0pi7k2ph8tcumwr5zp-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/ks1/table1_col6_idx_index-217b60b0810811efb1ad9425ad2d6f15/me-3gk1_1tqf_19fsx2ph8tcumwr5zp-big-Data.db:level=0:origin=compaction]

But before scylla reported that it is ready:

2024-10-02T23:41:50.894+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3      !ERR | scylla[17807]:  [shard 0:strm] rpc - client 10.4.1.145:63742: server connection dropped: sendmsg: Broken pipe
2024-10-02T23:41:51.171+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 3:comp] compaction - [Compact ks1.table1_mv_0 e44a4c01-8117-11ef-abdb-46460904877d] Compacted 2 sstables to [/var/lib/scylla/data/ks1/table1_mv_0-233a9ce0810811ef9aa83abf2b9bce36/me-3gk1_1tto_3mqsh2m31ksovlmf7h-big-Data.db:level=0]. 49MB to 49MB (~100% of original) in 2211ms = 22MB/s. ~28544 total partitions merged to 28438.
2024-10-02T23:41:51.171+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 3:comp] compaction - [Compact ks1.table1 e5ac7410-8117-11ef-abdb-46460904877d] Compacting [/var/lib/scylla/data/ks1/table1-20f98680810811ef92da09c917e125b0/me-3gk1_1ts3_39vts2ph8tcumwr5zp-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/ks1/table1-20f98680810811ef92da09c917e125b0/me-3gk1_1tq5_3sj4g2ph8tcumwr5zp-big-Data.db:level=0:origin=memtable]
2024-10-02T23:41:51.171+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 0:main] init - starting tracing
2024-10-02T23:41:51.171+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 0:main] init - SSTable data integrity checker is disabled.
2024-10-02T23:41:51.171+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 0:main] init - starting auth service
2024-10-02T23:41:51.171+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 0:main] init - starting batchlog manager
2024-10-02T23:41:51.172+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 0:main] init - starting load meter
2024-10-02T23:41:51.172+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 0:main] init - starting cf cache hit rate calculator
2024-10-02T23:41:51.172+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 0:main] init - starting view update backlog broker
2024-10-02T23:41:51.172+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 0:main] init - allow replaying hints
2024-10-02T23:41:51.172+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 0:main] init - Launching generate_mv_updates for non system tables
2024-10-02T23:41:51.172+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 0:main] migration_manager - Schema agreement check passed.
2024-10-02T23:41:51.172+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 6:main] migration_manager - Schema agreement check passed.
2024-10-02T23:41:51.172+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 1:main] migration_manager - Schema agreement check passed.
2024-10-02T23:41:51.172+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 3:main] migration_manager - Schema agreement check passed.
2024-10-02T23:41:51.172+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 4:main] migration_manager - Schema agreement check passed.
2024-10-02T23:41:51.173+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 2:main] migration_manager - Schema agreement check passed.
2024-10-02T23:41:51.173+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 5:main] migration_manager - Schema agreement check passed.
2024-10-02T23:41:51.173+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 0:stmt] cql_server_controller - Starting listening for CQL clients on 10.4.3.186:9042 (unencrypted, non-shard-aware)
2024-10-02T23:41:51.173+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 0:stmt] cql_server_controller - Starting listening for CQL clients on 10.4.3.186:19042 (unencrypted, shard-aware)
2024-10-02T23:41:51.173+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 0:main] init - serving
2024-10-02T23:41:51.173+00:00 rolling-upgrade--ubuntu-focal-db-node-4c08c3db-3     !INFO | scylla[17807]:  [shard 0:main] init - Scylla version 6.2.0~rc2-0.20241002.93700ff5d1ce initialization completed.

Impact

No impact on user except annoying error message which could trigger some alert. Upgrade finished without any other errors.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 4 nodes (i4i.2xlarge)

Scylla Nodes used in this run:

OS / Image: ami-0e6d50d10ca1e8aeb (aws: undefined_region)

Test: rolling-upgrade-ami-test Test id: 4c08c3db-5eb4-4c54-a3b5-0efa6cc6c991 Test name: scylla-6.2/rolling-upgrade/rolling-upgrade-ami-test Test method: upgrade_test.UpgradeTest.test_rolling_upgrade Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 4c08c3db-5eb4-4c54-a3b5-0efa6cc6c991` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=4c08c3db-5eb4-4c54-a3b5-0efa6cc6c991) - Show all stored logs command: `$ hydra investigate show-logs 4c08c3db-5eb4-4c54-a3b5-0efa6cc6c991` ## Logs: - **db-cluster-4c08c3db.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/4c08c3db-5eb4-4c54-a3b5-0efa6cc6c991/20241003_001102/db-cluster-4c08c3db.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/4c08c3db-5eb4-4c54-a3b5-0efa6cc6c991/20241003_001102/db-cluster-4c08c3db.tar.gz) - **sct-runner-events-4c08c3db.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/4c08c3db-5eb4-4c54-a3b5-0efa6cc6c991/20241003_001102/sct-runner-events-4c08c3db.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/4c08c3db-5eb4-4c54-a3b5-0efa6cc6c991/20241003_001102/sct-runner-events-4c08c3db.tar.gz) - **sct-4c08c3db.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/4c08c3db-5eb4-4c54-a3b5-0efa6cc6c991/20241003_001102/sct-4c08c3db.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/4c08c3db-5eb4-4c54-a3b5-0efa6cc6c991/20241003_001102/sct-4c08c3db.log.tar.gz) - **loader-set-4c08c3db.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/4c08c3db-5eb4-4c54-a3b5-0efa6cc6c991/20241003_001102/loader-set-4c08c3db.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/4c08c3db-5eb4-4c54-a3b5-0efa6cc6c991/20241003_001102/loader-set-4c08c3db.tar.gz) - **monitor-set-4c08c3db.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/4c08c3db-5eb4-4c54-a3b5-0efa6cc6c991/20241003_001102/monitor-set-4c08c3db.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/4c08c3db-5eb4-4c54-a3b5-0efa6cc6c991/20241003_001102/monitor-set-4c08c3db.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-6.2/job/rolling-upgrade/job/rolling-upgrade-ami-test/4/) [Argus](https://argus.scylladb.com/test/86b69cab-d2c8-43f3-87ce-8e23bbe75848/runs?additionalRuns[]=4c08c3db-5eb4-4c54-a3b5-0efa6cc6c991)
kbr-scylla commented 1 month ago

@aleksbykov isn't this the same as https://github.com/scylladb/scylladb/issues/20754? You also reported a connection close there.

aleksbykov commented 1 month ago

@kbr-scylla , the case looks very similar, but the error message is different: raft_topology - topology change coordinator fiber got error std::runtime_error (raft topology: exec_global_command(barrier) failed with seastar::rpc::closed_error (connection is closed))

fruch commented 1 month ago

Seen this happening on master run this week

Packages

Base Scylla version: 6.2.0~rc1-20240919.a71d4bc49cc8 with build-id b4036257ffcab230cd320b1b62fa05de35460c13 Target Scylla version (or git commit hash): 6.3.0~dev-20241004.882a3c60e4a5 with build-id 18d05b9776a41807ef6d1e3080c8ebb1a2257831

Kernel Version: 6.8.0-1016-aws

Installation details

Cluster size: 4 nodes (im4gn.2xlarge)

Scylla Nodes used in this run:

OS / Image: ami-0629b30bb6e5459a6 (aws: undefined_region)

Test: rolling-upgrade-ami-arm-test Test id: c36c4f3f-74f5-464d-af80-385bea38caa4 Test name: scylla-master/rolling-upgrade/rolling-upgrade-ami-arm-test Test method: upgrade_test.UpgradeTest.test_rolling_upgrade Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor c36c4f3f-74f5-464d-af80-385bea38caa4` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=c36c4f3f-74f5-464d-af80-385bea38caa4) - Show all stored logs command: `$ hydra investigate show-logs c36c4f3f-74f5-464d-af80-385bea38caa4` ## Logs: - **db-cluster-c36c4f3f.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/c36c4f3f-74f5-464d-af80-385bea38caa4/20241006_141552/db-cluster-c36c4f3f.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/c36c4f3f-74f5-464d-af80-385bea38caa4/20241006_141552/db-cluster-c36c4f3f.tar.gz) - **sct-runner-events-c36c4f3f.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/c36c4f3f-74f5-464d-af80-385bea38caa4/20241006_141552/sct-runner-events-c36c4f3f.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/c36c4f3f-74f5-464d-af80-385bea38caa4/20241006_141552/sct-runner-events-c36c4f3f.tar.gz) - **sct-c36c4f3f.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/c36c4f3f-74f5-464d-af80-385bea38caa4/20241006_141552/sct-c36c4f3f.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/c36c4f3f-74f5-464d-af80-385bea38caa4/20241006_141552/sct-c36c4f3f.log.tar.gz) - **loader-set-c36c4f3f.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/c36c4f3f-74f5-464d-af80-385bea38caa4/20241006_141552/loader-set-c36c4f3f.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/c36c4f3f-74f5-464d-af80-385bea38caa4/20241006_141552/loader-set-c36c4f3f.tar.gz) - **monitor-set-c36c4f3f.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/c36c4f3f-74f5-464d-af80-385bea38caa4/20241006_141552/monitor-set-c36c4f3f.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/c36c4f3f-74f5-464d-af80-385bea38caa4/20241006_141552/monitor-set-c36c4f3f.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/rolling-upgrade/job/rolling-upgrade-ami-arm-test/133/) [Argus](https://argus.scylladb.com/test/8e5f044d-57c2-43ad-8fe8-ac75a11a422d/runs?additionalRuns[]=c36c4f3f-74f5-464d-af80-385bea38caa4)
kbr-scylla commented 1 month ago

The node saw that every node in the cluster supports a new feature after rolling upgrade completed, and it was trying to execute global barrier before marking the feature as enabled. Unfortunately this raced with gossiper, which didn't see the node as UP/restarted yet. When gossiper sees node as restarted it resets connections, this caused the barrier to fail. Then it retried and succeeded.

Similar to https://github.com/scylladb/scylladb/issues/20588#issuecomment-2363492593.

One way to mitigate this would be to modify topology_coordinator::exec_global_command to wait for nodes to be marked as UP before executing the global command (in this case barrier). Or not report an ERROR if barrier fails due to rpc::closed_error.

xtrey commented 3 weeks ago

Spotted in 6.1.3