Closed Serpentian closed 1 year ago
I cannot reproduce it locally in any way. But I got the following output, while making read-only request for the second bucket:
[001] router/router.test.lua [ fail ]
[001]
[001] Test failed! Result content mismatch:
[001] --- router/router.result Fri May 19 02:44:31 2023
[001] +++ var/rejects/router/router.reject Tue May 23 14:02:48 2023
[001] @@ -401,23 +401,64 @@
[001] _ = test_run:cmd('stop server storage_2_a')
[001] ---
[001] ...
[001] -util.check_error(vshard.router.call, 1, 'read', 'echo', {123})
[001] ----
[001] -- null
[001] -- bucket_id: 1
[001] - code: 8
[001] - unreachable_uuid: <replicaset_2>
[001] - name: UNREACHABLE_REPLICASET
[001] - message: There is no active replicas in replicaset <replicaset_2>
[001] - type: ShardingError
[001] +util.check_error(vshard.router.call, 2, 'read', 'echo', {123})
[001] +---
[001] +- 123
[001] +- null
[001] +...
[001] +vshard.router.info()
[001] +---
[001] +- replicasets:
[001] + <replicaset_2>:
[001] + replica:
[001] + status: missing
[001] + bucket:
[001] + unreachable: 2
[001] + uuid: <replicaset_2>
[001] + master:
[001] + network_timeout: 0.5
[001] + status: unreachable
[001] + uri: storage@127.0.0.1:3303
[001] + uuid: <storage_2_a>
[001] + <replicaset_1>:
[001] + replica:
[001] + status: missing
[001] + bucket:
[001] + available_rw: 0
[001] + uuid: <replicaset_1>
[001] + master:
[001] + network_timeout: 0.5
[001] + status: available
[001] + uri: storage@127.0.0.1:3301
[001] + uuid: <storage_1_a>
[001] + bucket:
[001] + unreachable: 2
[001] + available_ro: 0
[001] + unknown: 2998
[001] + available_rw: 0
[001] + status: 3
[001] + alerts:
[001] + - ['UNREACHABLE_MASTER', 'Master of replicaset <replicaset_2>
[001] + is unreachable: disconnected']
[001] + - ['SUBOPTIMAL_REPLICA', 'A current read replica in replicaset <replicaset_2>
[001] + is not optimal']
[001] + - ['UNREACHABLE_REPLICASET', 'There is no active replicas in replicaset <replicaset_2>']
[001] + - ['SUBOPTIMAL_REPLICA', 'A current read replica in replicaset <replicaset_1>
[001] + is not optimal']
[001] + - ['UNKNOWN_BUCKETS', '2998 buckets are not discovered']
[001] ...
[001] vshard.router.buckets_info(0, 3)
[001] ---
[001] - - status: unknown
[001] - uuid: <replicaset_2>
[001] - status: available_ro
[001] + status: unreachable
[001] - uuid: <replicaset_2>
[001] - status: available_ro
[001] + status: unreachable
[001] +...
[001] +util.check_error(vshard.router.call, 2, 'read', 'echo', {123})
[001] +---
[001] +- 123
[001] +- null
[001] ...
[001] _ = test_run:cmd('start server storage_2_a')
vshard.router.buckets_info()
says, that bucket in unreachable even when it's not. vshard.router.info()
says that replica is missing, but the replica is available. In test it's probably reproduced, when failover wakeups.
The problem out there is the fact, that sometimes failover doesn't have enough time to wakeup and call down_replica_priority
on the stopped master. So, buckets_info
and router_info
suppose, that the whole replicaset is down until the time, failover wakeups and change replicaset.replica
field
Reproduced on Tarantool 2.7