flaky test: router/router

Serpentian commented 1 year ago

Test failed! Result content mismatch:
--- router/router.result    Mon Dec  5 10:16:19 2022
+++ /home/runner/work/vshard/vshard/test/var/rejects/router/router.reject   Mon Dec  5 10:19:41 2022
@@ -415,9 +415,9 @@
 ---
 - - status: unknown
   - uuid: <replicaset_2>
-    status: available_ro
+    status: unreachable
   - uuid: <replicaset_2>
-    status: available_ro
+    status: unreachable
 ...
 _ = test_run:cmd('start server storage_2_a')
 ---

Reproduced on Tarantool 2.7

Serpentian commented 1 year ago

I cannot reproduce it locally in any way. But I got the following output, while making read-only request for the second bucket:

[001] router/router.test.lua                                          [ fail ]
[001]
[001] Test failed! Result content mismatch:
[001] --- router/router.result  Fri May 19 02:44:31 2023
[001] +++ var/rejects/router/router.reject  Tue May 23 14:02:48 2023
[001] @@ -401,23 +401,64 @@
[001]  _ = test_run:cmd('stop server storage_2_a')
[001]  ---
[001]  ...
[001] -util.check_error(vshard.router.call, 1, 'read', 'echo', {123})
[001] ----
[001] -- null
[001] -- bucket_id: 1
[001] -  code: 8
[001] -  unreachable_uuid: <replicaset_2>
[001] -  name: UNREACHABLE_REPLICASET
[001] -  message: There is no active replicas in replicaset <replicaset_2>
[001] -  type: ShardingError
[001] +util.check_error(vshard.router.call, 2, 'read', 'echo', {123})
[001] +---
[001] +- 123
[001] +- null
[001] +...
[001] +vshard.router.info()
[001] +---
[001] +- replicasets:
[001] +    <replicaset_2>:
[001] +      replica:
[001] +        status: missing
[001] +      bucket:
[001] +        unreachable: 2
[001] +      uuid: <replicaset_2>
[001] +      master:
[001] +        network_timeout: 0.5
[001] +        status: unreachable
[001] +        uri: storage@127.0.0.1:3303
[001] +        uuid: <storage_2_a>
[001] +    <replicaset_1>:
[001] +      replica:
[001] +        status: missing
[001] +      bucket:
[001] +        available_rw: 0
[001] +      uuid: <replicaset_1>
[001] +      master:
[001] +        network_timeout: 0.5
[001] +        status: available
[001] +        uri: storage@127.0.0.1:3301
[001] +        uuid: <storage_1_a>
[001] +  bucket:
[001] +    unreachable: 2
[001] +    available_ro: 0
[001] +    unknown: 2998
[001] +    available_rw: 0
[001] +  status: 3
[001] +  alerts:
[001] +  - ['UNREACHABLE_MASTER', 'Master of replicaset <replicaset_2>
[001] +      is unreachable: disconnected']
[001] +  - ['SUBOPTIMAL_REPLICA', 'A current read replica in replicaset <replicaset_2>
[001] +      is not optimal']
[001] +  - ['UNREACHABLE_REPLICASET', 'There is no active replicas in replicaset <replicaset_2>']
[001] +  - ['SUBOPTIMAL_REPLICA', 'A current read replica in replicaset <replicaset_1>
[001] +      is not optimal']
[001] +  - ['UNKNOWN_BUCKETS', '2998 buckets are not discovered']
[001]  ...
[001]  vshard.router.buckets_info(0, 3)
[001]  ---
[001]  - - status: unknown
[001]    - uuid: <replicaset_2>
[001] -    status: available_ro
[001] +    status: unreachable
[001]    - uuid: <replicaset_2>
[001] -    status: available_ro
[001] +    status: unreachable
[001] +...
[001] +util.check_error(vshard.router.call, 2, 'read', 'echo', {123})
[001] +---
[001] +- 123
[001] +- null
[001]  ...
[001]  _ = test_run:cmd('start server storage_2_a')

vshard.router.buckets_info() says, that bucket in unreachable even when it's not. vshard.router.info() says that replica is missing, but the replica is available. In test it's probably reproduced, when failover wakeups.

Serpentian commented 1 year ago

The problem out there is the fact, that sometimes failover doesn't have enough time to wakeup and call down_replica_priority on the stopped master. So, buckets_info and router_info suppose, that the whole replicaset is down until the time, failover wakeups and change replicaset.replica field

tarantool / vshard

flaky test: router/router #389