tarantool / vshard

The new generation of sharding based on virtual buckets
Other
100 stars 30 forks source link

During reconfiguration process some requests on router fails #482

Open Serpentian opened 2 months ago

Serpentian commented 2 months ago

We need to minimize the number of failed requests on router during reconfiguration process. Users are faced with a problem of router returning errors, when e.g. new replicaset is added. Probably related to https://github.com/tarantool/vshard/issues/481

Serpentian commented 2 months ago

2sp for investigation and figuring out, whether we can do something about this.

Serpentian commented 2 months ago

Fix the flakiness of the reconfiguration stress test in the scope of this issue:

rebalancer/stress_add_remove_several_rs.test.l> vinyl           [ fail ]

Test failed! Result content mismatch:
--- rebalancer/stress_add_remove_several_rs.result  Fri Apr 26 12:50:35 2024
+++ /__w/vshard-ee/vshard-ee/test/var/rejects/rebalancer/stress_add_remove_several_rs.reject    Mon Aug 12 04:25:42 2024
@@ -500,19 +500,19 @@
 ...
 #box.space._bucket.index.status:select{vshard.consts.BUCKET.ACTIVE}
 ---
-- 100
-...
-check_consistency()
----
-- true
-...
-test_run:switch('box_2_a')
----
-- true
-...
-#box.space._bucket.index.status:select{vshard.consts.BUCKET.ACTIVE}
----
-- 100
+- 66
+...
+check_consistency()
+---
+- true
+...
+test_run:switch('box_2_a')
+---
+- true
+...
+#box.space._bucket.index.status:select{vshard.consts.BUCKET.ACTIVE}
+---
+- 67
 ...
 check_consistency()
 ---
@@ -524,24 +524,24 @@
 ...
 #box.space._bucket.index.status:select{vshard.consts.BUCKET.ACTIVE}
 ---
+- 67
+...
+check_consistency()
+---
+- true
+...
+test_run:switch('box_4_a')
+---
+- true
+...
+#box.space._bucket.index.status:select{vshard.consts.BUCKET.ACTIVE}
+---
 - 0
 ...
 check_consistency()
 ---
 - true
 ...
-test_run:switch('box_4_a')
----
-- true
-...
-#box.space._bucket.index.status:select{vshard.consts.BUCKET.ACTIVE}
----
-- 0
-...
-check_consistency()
----
-- true
-...
 test_run:switch('default')
 ---
 - true

[test-run server "test"] Last 15 lines of the log file /__w/vshard-ee/vshard-ee/test/var/001_rebalancer/test.log:
2024-08-12 04:25:18.432 [1651] main/347/console/unix/: I> Slaves are connected to a master "box_1_a"
2024-08-12 04:25:18.433 [1651] main/347/console/unix/: I> Waiting until slaves are connected to a master
2024-08-12 04:25:18.539 [1651] main/347/console/unix/: I> Slaves are connected to a master "box_2_a"
2024-08-12 04:25:19.175 [1651] main/355/console/unix/: I> Waiting until slaves are connected to a master
2024-08-12 04:25:19.283 [1651] main/355/console/unix/: I> Slaves are connected to a master "box_3_a"
2024-08-12 04:25:21.722 [1651] main/361/console/unix/: I> Waiting until slaves are connected to a master
2024-08-12 04:25:21.829 [1651] main/361/console/unix/: I> Slaves are connected to a master "box_4_a"
2024-08-12 04:25:31.071 [1651] main/367/console/unix/: I> Waiting until slaves are connected to a master
2024-08-12 04:25:31.079 [1651] main/367/console/unix/: I> Slaves are connected to a master "box_1_a"
2024-08-12 04:25:31.079 [1651] main/367/console/unix/: I> Waiting until slaves are connected to a master
2024-08-12 04:25:31.291 [1651] main/367/console/unix/: I> Slaves are connected to a master "box_2_a"
2024-08-12 04:25:31.845 [1651] main/373/console/unix/: I> Waiting until slaves are connected to a master
2024-08-12 04:25:31.953 [1651] main/373/console/unix/: I> Slaves are connected to a master "box_3_a"
2024-08-12 04:25:34.402 [1651] main/379/console/unix/: I> Waiting until slaves are connected to a master
2024-08-12 04:25:34.509 [1651] main/379/console/unix/: I> Slaves are connected to a master "box_4_a"
Reproduce file /__w/vshard-ee/vshard-ee/test/var/reproduce/001_rebalancer.list.yaml
---
- [rebalancer/bucket_ref.test.lua, null]
- [rebalancer/errinj.test.lua, null]
- [rebalancer/parallel.test.lua, memtx]
- [rebalancer/parallel.test.lua, vinyl]
- [rebalancer/rebalancer.test.lua, memtx]
- [rebalancer/rebalancer.test.lua, vinyl]
- [rebalancer/rebalancer2.test.lua, null]
- [rebalancer/rebalancer_lock_and_pin.test.lua, null]
- [rebalancer/receiving_bucket.test.lua, null]
- [rebalancer/restart_during_rebalancing.test.lua, memtx]
- [rebalancer/restart_during_rebalancing.test.lua, vinyl]
- [rebalancer/stress_add_remove_rs.test.lua, memtx]
- [rebalancer/stress_add_remove_rs.test.lua, vinyl]
- [rebalancer/stress_add_remove_several_rs.test.lua, memtx]
- [rebalancer/stress_add_remove_several_rs.test.lua, vinyl]
...
---------------------------------------------------------------------------
[Instance test] Stopping the server...