tarantool / vshard

The new generation of sharding based on virtual buckets
Other
100 stars 30 forks source link

flaky test: rebalancer/stress_add_remove_rs.test.lua (on memtx) hangs #309

Closed kyukhin closed 2 years ago

kyukhin commented 2 years ago

The log can be seen here.

Totktonada commented 2 years ago

The logs in CI expire after some time (don't remember precisely, several months), so I suggest to attach them into issues.

logs.txt

Gerold103 commented 2 years ago

Hangs on vinyl too sometimes. The reason is that rebalancer_utils.start_loading(), if called more than once during one test run, creates duplicate primary keys. Because it uses write_iterations as pk, which is reset to 0 on each start_loading(). A good-enough fix could be to truncate the test space before doing next start_loading().

Also as a safety measure it is worth limiting write_iterations somehow. For example, make it max N * bucket_count and increment by modulo of this value. Then it is limited and it won't produce duplicate pks.

Serpentian commented 2 years ago

Unfortunately, truncating spaces doesn't solve the problem. Moreover, why do we bother about duplicated primary keys when in do_write_load we call do_replace.

Sometimes, read_iterations equals to 0, which means that router cannot read any tuple from replicas. Router's log looks like that:

2022-07-07 17:43:54.010 [121130] main/150/console/::1:56958 I> Starting router configuration
2022-07-07 17:43:54.011 [121130] main/150/console/::1:56958 I> Calling box.cfg()...
2022-07-07 17:43:54.011 [121130] main/150/console/::1:56958 I> {"replication_connect_quorum":0}
2022-07-07 17:43:54.011 [121130] main/150/console/::1:56958 I> {"replication_connect_timeout":0.01}
2022-07-07 17:43:54.011 [121130] main/150/console/::1:56958 I> Box has been configured
2022-07-07 17:43:54.012 [121130] main/151/127.0.0.1:3306 (net.box) I> connected to 127.0.0.1:3306
2022-07-07 17:43:54.013 [121130] main/152/127.0.0.1:3305 (net.box) I> connected to 127.0.0.1:3305
2022-07-07 17:43:54.013 [121130] main/153/lua I> Old replicaset and replica objects are outdated.
2022-07-07 17:43:54.350 [121130] main/123/vshard.failover._static_router I> New replica box_3_a(storage@127.0.0.1:3305) for replicaset(uuid="910ee49b-2540-41b6-9b8c-c976bef1bb17", master=box_3_a(storage@127.0.0.1:3305))
2022-07-07 17:43:54.350 [121130] main/123/vshard.failover._static_router I> New replica box_2_a(storage@127.0.0.1:3303) for replicaset(uuid="ac522f65-aa94-4134-9f64-51ee384f1a54", master=box_2_a(storage@127.0.0.1:3303))
2022-07-07 17:43:54.350 [121130] main/123/vshard.failover._static_router I> New replica box_1_a(storage@127.0.0.1:3301) for replicaset(uuid="cbf06940-0790-498b-948d-042b62cf3d29", master=box_1_a(storage@127.0.0.1:3301))
2022-07-07 17:43:54.350 [121130] main/123/vshard.failover._static_router I> Failovering step is finished. Schedule next after 1.000000 seconds
2022-07-07 17:43:54.740 [121130] main/141/lua replicaset.lua:362 E> Exception during calling 'vshard.storage.call' on 'box_1_a(storage@127.0.0.1:3301)': fiber is cancelled
2022-07-07 17:43:54.740 [121130] main/141/lua replicaset.lua:362 E> Exception during calling 'vshard.storage.call' on 'box_1_a(storage@127.0.0.1:3301)': fiber is cancelled
2022-07-07 17:43:54.740 [121130] main/141/lua replicaset.lua:362 E> Exception during calling 'vshard.storage.call' on 'box_1_a(storage@127.0.0.1:3301)': fiber is cancelled
...
2022-07-07 17:44:26.626 [121130] main/141/lua replicaset.lua:362 E> Exception during calling 'vshard.storage.call' on 'box_2_a(storage@127.0.0.1:3303)': fiber is cancelled
2022-07-07 17:44:26.627 [121130] main/141/lua replicaset.lua:362 E> Exception during calling 'vshard.storage.call' on 'box_2_a(storage@127.0.0.1:3303)': fiber is cancelled
2022-07-07 17:44:26.627 [121130] main/141/lua replicaset.lua:362 E> Exception during calling 'vshard.storage.call' on 'box_2_a(storage@127.0.0.1:3303)': fiber is cancelled
2022-07-07 17:44:26.627 [121130] main/141/lua replicaset.lua:362 E> Exception during calling 'vshard.storage.call' on 'box_2_a(storage@127.0.0.1:3303)': fiber is cancelled
2022-07-07 17:44:26.627 [121130] main C> got signal 2 - Interrupt
2022-07-07 17:44:26.628 [121130] main/141/lua replicaset.lua:362 E> Exception during calling 'vshard.storage.call' on 'box_2_a(storage@127.0.0.1:3303)': fiber is cancelled
2022-07-07 17:44:26.628 [121130] main/152/127.0.0.1:3305 (net.box) I> disconnected from 127.0.0.1:3305
2022-07-07 17:44:26.628 [121130] main/152/127.0.0.1:3305 (net.box) net_box.lua:307 W> 127.0.0.1:3305: Peer closed
2022-07-07 17:44:26.628 [121130] main/151/127.0.0.1:3306 (net.box) I> disconnected from 127.0.0.1:3306
2022-07-07 17:44:26.628 [121130] main/151/127.0.0.1:3306 (net.box) net_box.lua:307 W> 127.0.0.1:3306: Peer closed
2022-07-07 17:44:26.628 [121130] main/121/127.0.0.1:3301 (net.box) I> disconnected from 127.0.0.1:3301
2022-07-07 17:44:26.628 [121130] main/121/127.0.0.1:3301 (net.box) net_box.lua:307 W> 127.0.0.1:3301: Peer closed
2022-07-07 17:44:26.628 [121130] main/122/127.0.0.1:3302 (net.box) I> disconnected from 127.0.0.1:3302
2022-07-07 17:44:26.628 [121130] main/122/127.0.0.1:3302 (net.box) net_box.lua:307 W> 127.0.0.1:3302: Peer closed
2022-07-07 17:44:26.629 [121130] main/119/127.0.0.1:3304 (net.box) I> disconnected from 127.0.0.1:3304
2022-07-07 17:44:26.629 [121130] main/119/127.0.0.1:3304 (net.box) net_box.lua:307 W> 127.0.0.1:3304: Peer closed
2022-07-07 17:44:26.629 [121130] main/187/iproto.shutdown I> tx_binary: stopped

The problem is definitely in this cancelled fiber on storage. I'm still investigating the problem.

@Gerold103, maybe you have some thoughts on this?

I'm attaching all logs. out.txt file is the output from rebalancer_utils.stop_loading() rebalancer.tgz

Serpentian commented 2 years ago

However, I could do something wrong about truncating spaces. Here's the draft PR with truncating: #349. Unfortunately, this test passed on CI this time, but if you restart it several times it will fail some day.

Gerold103 commented 2 years ago

Moreover, why do we bother about duplicated primary keys when in do_write_load we call do_replace.

If you start load, then stop it, and start again, then you will start with write_iterations 0 again. That causes replaces of the already generated primary keys into potentially other buckets. So you get tuples like {1, 10} and {1, 20} in buckets 10 and 20 on different replicasets. But when rebalancer tries to put buckets 10 and 20 onto a same replicaset, they fail with the duplicate primary key {1} error. I can definitely say this was one of the reasons why the test fails, because I saw in the logs an error about duplicate primary key when it clashed from 2 different buckets. Doesn't mean though that this is the only reason.

The problem is definitely in this cancelled fiber on storage.

It doesn't necessarily mean it was cancelled on the storage. More likely it was cancelled on the router. The netbox internal fiber, for instance. Something could go wrong inside of it.

Serpentian commented 2 years ago

Most likely, the issue is related to #341

Gerold103 commented 2 years ago

I wouldn't be so sure. In #341 they kill all the fibers explicitly. In the tests we don't do that.

Gerold103 commented 2 years ago

Sorry, truncation was never needed. I figured out the reason, See the attached PR above.