Closed avtikhon closed 2 years ago
It also catches assertion time to time:
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
tarantool/tarantool#1 0x00007f0cf07da8b1 in __GI_abort () at abort.c:79
tarantool/tarantool#2 0x00007f0cf07ca42a in __assert_fail_base (fmt=0x7f0cf0951a38 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x55e86001da89 "lsn > limbo->confirmed_lsn",
file=file@entry=0x55e86001d708 "/home/kosarev/tarantool/src/box/txn_limbo.c", line=line@entry=340, function=function@entry=0x55e86001dc40 <__PRETTY_FUNCTION__.12489> "txn_limbo_write_confirm")
at assert.c:92
tarantool/tarantool#3 0x00007f0cf07ca4a2 in __GI___assert_fail (assertion=0x55e86001da89 "lsn > limbo->confirmed_lsn", file=0x55e86001d708 "/home/kosarev/tarantool/src/box/txn_limbo.c", line=340,
function=0x55e86001dc40 <__PRETTY_FUNCTION__.12489> "txn_limbo_write_confirm") at assert.c:101
tarantool/tarantool#4 0x000055e85fbdb753 in txn_limbo_write_confirm (limbo=0x55e860416a60 <txn_limbo>, lsn=9) at /home/kosarev/tarantool/src/box/txn_limbo.c:340
tarantool/tarantool#5 0x000055e85fbdc328 in txn_limbo_force_empty (limbo=0x55e860416a60 <txn_limbo>, confirm_lsn=9) at /home/kosarev/tarantool/src/box/txn_limbo.c:622
tarantool/tarantool#6 0x000055e85fbe291e in box_clear_synchro_queue () at /home/kosarev/tarantool/src/box/box.cc:1071
tarantool/tarantool#7 0x000055e85fcb5cd3 in lbox_ctl_clear_synchro_queue (L=0x4151c0b8) at /home/kosarev/tarantool/src/box/lua/ctl.c:85
tarantool/tarantool#8 0x000055e85fe400cb in lj_BC_FUNCC ()
tarantool/tarantool#9 0x000055e85fdfb84b in jit_secure_call (L=0x4151c0b8, base=0x4151c118, nres=0) at lj_api.c:88
tarantool/tarantool#10 0x000055e85fe02ae0 in lua_call (L=0x4151c0b8, nargs=0, nresults=-1) at lj_api.c:1133
tarantool/tarantool#11 0x000055e85fca3b94 in execute_lua_eval (L=0x4151c0b8) at /home/kosarev/tarantool/src/box/lua/call.c:394
tarantool/tarantool#12 0x000055e85fe400cb in lj_BC_FUNCC ()
tarantool/tarantool#13 0x000055e85fe02e15 in lua_pcall (L=0x4151c0b8, nargs=1, nresults=-1, errfunc=0) at lj_api.c:1158
tarantool/tarantool#14 0x000055e85fd87bd6 in luaT_call (L=0x4151c0b8, nargs=1, nreturns=-1) at /home/kosarev/tarantool/src/lua/utils.c:1016
tarantool/tarantool#15 0x000055e85fca42e1 in box_process_lua (handler=HANDLER_EVAL, ctx=0x7f0ce79ffd10, ret=0x7f0ce79ffe60) at /home/kosarev/tarantool/src/box/lua/call.c:570
tarantool/tarantool#16 0x000055e85fca43ff in box_lua_eval (expr=0x7f0ce701803d "box.ctl.clear_synchro_queue()!\220\021", expr_len=29, args=0x7f0ce79ffd80, ret=0x7f0ce79ffe60)
at /home/kosarev/tarantool/src/box/lua/call.c:614
tarantool/tarantool#17 0x000055e85fc130bb in box_process_eval (request=0x7f0ce7010430, port=0x7f0ce79ffe60) at /home/kosarev/tarantool/src/box/call.c:187
tarantool/tarantool#18 0x000055e85fcc75f4 in tx_process_call (m=0x7f0ce70103b0) at /home/kosarev/tarantool/src/box/iproto.cc:1632
tarantool/tarantool#19 0x000055e85fdb062e in cmsg_deliver (msg=0x7f0ce70103b0) at /home/kosarev/tarantool/src/lib/core/cbus.c:375
tarantool/tarantool#20 0x000055e85fdb170a in fiber_pool_f (ap=0x7f0ced801008) at /home/kosarev/tarantool/src/lib/core/fiber_pool.c:64
tarantool/tarantool#21 0x000055e85fba0269 in fiber_cxx_invoke(fiber_func, typedef __va_list_tag __va_list_tag *) (f=0x55e85fdb156a <fiber_pool_f>, ap=0x7f0ced801008) at /home/kosarev/tarantool/src/lib/core/fiber.h:788
tarantool/tarantool#22 0x000055e85fdaae15 in fiber_loop (data=0x0) at /home/kosarev/tarantool/src/lib/core/fiber.c:879
tarantool/tarantool#23 0x000055e86001556d in coro_init () at /home/kosarev/tarantool/third_party/coro/coro.c:110
A minimal 100% reproducer. Follow the steps in the specified order on the specified instances. However it is not related to why the test hangs sometimes. This I don't know.
-- Instance 1
-- Step 1
fiber = require('fiber')
box.cfg{
listen = 3313,
replication = {'localhost:3313', 'localhost:3314'},
replication_synchro_quorum = 2,
replication_synchro_timeout = 1000000,
read_only = false,
}
box.schema.user.grant('guest', 'super')
s1 = box.schema.create_space('test1', {is_sync = true})
_ = s1:create_index('pk')
-- Step 3
for i = 1, 10 do s1:replace{i} end
-- Step 5
box.cfg{
replication_synchro_timeout = 0.001,
}
box.ctl.clear_synchro_queue()
-- Instance 2
-- Step 2
box.cfg{
listen = 3314,
replication = {'localhost:3313', 'localhost:3314'},
replication_synchro_quorum = 3,
replication_synchro_timeout = 1000000,
read_only = true,
}
-- Step 4
box.cfg{read_only = false}
box.space.test1:replace{11}
I didn't look deeply, but I see that fails of the test (all kinds of them except the rare one with the error Found uncommitted sync transactions from other instance with id <..>
) are accompanied with the RAFT: vote request is skipped - the vclock is not acceptable
log message, while successful runs do not have such log message most of times (but not always).
A minimal 100% reproducer. Follow the steps in the specified order on the specified instances.
I don't see anything that would look like an error or inconsistency in the space data on 2.7.0-2-g66bb1a507 and 2.7.0-32-g77c218cc1 (debug builds). Maybe I doing something wrong (just copy-pasted the snippets to two consoles in the specified order).
It also catches assertion time to time.
Didn't catched it (I'm on debug build of version 2.7.0-16-gbc6caac60, on Linux).
You are too late. The reproducer was for the crash, which I already fixed in https://github.com/tarantool/tarantool/commit/41ba1479abf630b378c3712ef52aad7a823f1b42. Unfortunately, the issue is clogged with tons "avtikhon added a commit that referenced this issue yesterday" message, so you could miss my commit above.
The flakiness is not related to the crash, so I didn't close the issue.
Cases 1 and 2 are not present in election_qsync.test.lua
file at all. They existed in election_qsync_stress.test.lua
in the beginning but in https://github.com/tarantool/tarantool/commit/eb989bc7c68fa3be0b8dcfc8566a497f4013708a they were merged with other failures having the same reason.
Cases 4 and 5 were failing because the test didn't wait for the old leader to send all data to the other nodes. As a result, they could have less data than needed and the space didn't have enough tuples in the end of the test. It seems to be fixed by https://github.com/tarantool/tarantool/commit/bf0fbf3a05c5db05be65fb1bce8b1ed9b959e607.
Cases 3 and 4 (the latter had 2 issues) seem to be caused by old code relying on box.info.election.role == 'leader'
being enough to be writable. That is still a problem (for tarantool/tarantool#6689), but the test got a workaround in this commit: https://github.com/tarantool/tarantool/commit/71377c28e1c20108a8691481660bea6263f9a2e8.
I have run both election_qsync.test.lua
and its stress version on my machine for half an hour and couldn't get any failures. Probably it still fails in CI, but I believe due to other reasons, not related to this ticket already.
As a summary: I think @sergepetrenko fixed this ticket quite some time ago in the commits above. Sergey, can you try it on your machine too? @sergos, you could also try (me and Sergey P. both are on Macs I think, we could use a person with Linux)?
The most recent new error not in this ticket on master I see here: https://github.com/tarantool/tarantool/runs/5590105468?check_suite_focus=true. Have no idea what might be causing it. It fails on bootstrap apparently.
After running of ./test-run.py $(yes election_qsync | head -100)
couple of times I have got:
[141] replication/election_qsync_stress.test.lua vinyl
[141] * [QA Notice]
[141] *
[141] * Attempt to evaluate a command on the nonexistent server 'election_replica0'
[141] *
[141]
[141] * [QA Notice]
[141] *
[141] * Attempt to stop already stopped server 'election_replica1'
[141] *
[141] [ fail ]
[141]
[141] Test failed! Result content mismatch:
[141] --- replication/election_qsync_stress.result Fri Feb 11 13:34:49 2022
[141] +++ var/rejects/replication/election_qsync_stress.reject Wed Mar 23 18:26:30 2022
[141] @@ -111,6 +111,7 @@
[141] old_leader = new_leader
[141] end;
[141] | ---
[141] + | - error: Attempt to evaluate a command on the nonexistent server 'election_replica0'
[141] | ...
[141] test_run:cmd('setopt delimiter ""');
[141] | ---
[141] @@ -119,7 +120,7 @@
[141] -- We're connected to some leader.
[141] #c.space.test:select{} == 10 or require('log').error(c.space.test:select{})
[141] | ---
[141] - | - true
[141] + | - error: Peer closed
[141] | ...
[141]
[141] test_run:drop_cluster(SERVERS)
[141]
[141] Last 15 lines of Tarantool Log file [Instance "master"][/workspaces/t.sergos/test/var/141_replication/master.log]:
[141] 2022-03-23 18:25:18.773 [15158] main/121/console/unix/: I> election_replica1 -> election_replica2: waiting for connection
[141] 2022-03-23 18:25:18.792 [15158] main/121/console/unix/: I> connecting election_replica1 -> election_replica2: {"status":"follow"}
[141] 2022-03-23 18:25:18.792 [15158] main/121/console/unix/: I> election_replica1 -> election_replica2: connected
[141] 2022-03-23 18:25:18.792 [15158] main/121/console/unix/: I> election_replica3 -> election_replica2: waiting for connection
[141] 2022-03-23 18:25:18.816 [15158] main/121/console/unix/: I> connecting election_replica3 -> election_replica2: {"status":"follow"}
[141] 2022-03-23 18:25:18.816 [15158] main/121/console/unix/: I> election_replica3 -> election_replica2: connected
[141] 2022-03-23 18:25:18.816 [15158] main/121/console/unix/: I> election_replica3: waiting bootstrap
[141] 2022-03-23 18:25:18.831 [15158] main/121/console/unix/: I> election_replica3: bootstrapped
[141] 2022-03-23 18:25:18.831 [15158] main/121/console/unix/: I> election_replica1 -> election_replica3: waiting for connection
[141] 2022-03-23 18:25:18.862 [15158] main/121/console/unix/: I> connecting election_replica1 -> election_replica3: {"status":"follow"}
[141] 2022-03-23 18:25:18.862 [15158] main/121/console/unix/: I> election_replica1 -> election_replica3: connected
[141] 2022-03-23 18:25:18.862 [15158] main/121/console/unix/: I> election_replica2 -> election_replica3: waiting for connection
[141] 2022-03-23 18:25:18.877 [15158] main/121/console/unix/: I> connecting election_replica2 -> election_replica3: {"status":"follow"}
[141] 2022-03-23 18:25:18.878 [15158] main/121/console/unix/: I> election_replica2 -> election_replica3: connected
[141] 2022-03-23 18:25:18.878 [15158] main/121/console/unix/: I> full mesh connected
[141] Test "replication/election_qsync_stress.test.lua", conf: "vinyl"
[141] from "fragile" list failed, rerunning ...
Which apparently have one way to appear:
new_leader_nr = get_leader(nrs)
new_leader = 'election_replica'..new_leader_nr
leader_port = test_run:eval(new_leader, 'box.cfg.listen')[1]
or the get_leader()
itself, where 0 can be returned, so it will be used in eval.
Got another failure:
[081] replication/election_qsync_stress.test.lua memtx
[081]
[081] [Instance "election_replica2" returns with non-zero exit code: 1]
[081]
[081] Last 15 lines of Tarantool Log file [Instance "election_replica2"][/workspaces/t.sergos/test/var/081_replication/election_replica2.log]:
[081] 2022-03-28 13:40:14.541 [30745] main/115/applier/unix/:/workspaces/t.sergos/test/var/081_replication/election_replica3.sock sio.c:210 E> SocketError: connect to unix/:/workspaces/t.sergos/test/var/081_replication/election_replica3.sock, called on fd 26, aka unix/:(socket): No such file or directory
[081] 2022-03-28 13:40:14.541 [30745] main/115/applier/unix/:/workspaces/t.sergos/test/var/081_replication/election_replica3.sock I> will retry every 0.10 second
[081] 2022-03-28 13:40:14.541 [30745] main/113/applier/unix/:/workspaces/t.sergos/test/var/081_replication/election_replica1.sock I> remote master 6dd0c5ad-e4f9-4d3f-adc5-25d83bc0db84 at unix/:/workspaces/t.sergos/test/var/081_replication/election_replica1.sock running Tarantool 2.10.0
[081] 2022-03-28 13:40:14.543 [30745] main/114/applier/unix/:/workspaces/t.sergos/test/var/081_replication/election_replica2.sock I> remote master afebf07a-a327-4890-8eaa-7f8cfc159eaa at unix/:/workspaces/t.sergos/test/var/081_replication/election_replica2.sock running Tarantool 2.10.0
[081] 2022-03-28 13:40:14.845 [30745] main/115/applier/unix/:/workspaces/t.sergos/test/var/081_replication/election_replica3.sock I> remote master e9e613a7-bbc2-4250-9f3b-e13ad7e46b45 at unix/:/workspaces/t.sergos/test/var/081_replication/election_replica3.sock running Tarantool 2.10.0
[081] 2022-03-28 13:40:14.846 [30745] main/103/election_replica2 I> connected to 3 replicas
[081] 2022-03-28 13:40:14.847 [30745] main/103/election_replica2 I> bootstrapping replica from e9e613a7-bbc2-4250-9f3b-e13ad7e46b45 at unix/:/workspaces/t.sergos/test/var/081_replication/election_replica3.sock
[081] 2022-03-28 13:40:14.847 [30745] main/115/applier/unix/:/workspaces/t.sergos/test/var/081_replication/election_replica3.sock I> can't read row
[081] 2022-03-28 13:40:14.847 [30745] main/115/applier/unix/:/workspaces/t.sergos/test/var/081_replication/election_replica3.sock box.cc:2869 E> ER_LOADING: Instance bootstrap hasn't finished yet
[081] 2022-03-28 13:40:14.847 [30745] main/115/applier/unix/:/workspaces/t.sergos/test/var/081_replication/election_replica3.sock I> will retry every 0.10 second
[081] 2022-03-28 13:40:14.947 [30745] main/115/applier/unix/:/workspaces/t.sergos/test/var/081_replication/election_replica3.sock I> can't read row
[081] 2022-03-28 13:40:14.947 [30745] main/115/applier/unix/:/workspaces/t.sergos/test/var/081_replication/election_replica3.sock box.cc:198 E> ER_READONLY: Can't modify data on a read-only instance - state is election follower with term 2, leader is 1 (6dd0c5ad-e4f9-4d3f-adc5-25d83bc0db84)
[081] 2022-03-28 13:40:14.947 [30745] main/103/election_replica2 box.cc:198 E> ER_READONLY: Can't modify data on a read-only instance - state is election follower with term 2, leader is 1 (6dd0c5ad-e4f9-4d3f-adc5-25d83bc0db84)
[081] 2022-03-28 13:40:14.947 [30745] main/103/election_replica2 F> can't initialize storage: Can't modify data on a read-only instance - state is election follower with term 2, leader is 1 (6dd0c5ad-e4f9-4d3f-adc5-25d83bc0db84)
[081] 2022-03-28 13:40:14.947 [30745] main/103/election_replica2 F> can't initialize storage: Can't modify data on a read-only instance - state is election follower with term 2, leader is 1 (6dd0c5ad-e4f9-4d3f-adc5-25d83bc0db84)
[081] [ fail ]
or the get_leader() itself, where 0 can be returned, so it will be used in eval.
Apparently, we have an infinite elections - replica2 (id == 2) tried to deliver it's vote to the replica3 (id == 1) and after two sequential votes for replica3 it starts to vote for itself. Note, that replica2 keep receiving votes from replica3, but not the vice versa.
replica2:
2022-03-28 14:10:52.663 [76330] main/112/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica1.sock I> can't read row
2022-03-28 14:10:52.663 [76330] main/112/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica1.sock coio.c:349 E> SocketError: unexpected EOF when reading from socket, called on fd 26, aka unix/:(socket), peer of unix/:(socket): Broken pipe
2022-03-28 14:10:52.663 [76330] main/112/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica1.sock I> will retry every 0.10 second
2022-03-28 14:10:52.663 [76330] relay/unix/:(socket)/101/main coio.c:349 E> SocketError: unexpected EOF when reading from socket, called on fd 27, aka unix/:(socket), peer of unix/:(socket): Broken pipe
2022-03-28 14:10:52.663 [76330] relay/unix/:(socket)/101/main I> exiting the relay loop
2022-03-28 14:10:52.948 [76330] main/114/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica3.sock I> RAFT: message {term: 4, vote: 1, state: candidate, vclock: {1: 9, 3: 3}} from 1
2022-03-28 14:10:52.948 [76330] main/114/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica3.sock I> RAFT: bump term to 4, follow
2022-03-28 14:10:52.948 [76330] main/114/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica3.sock I> RAFT: vote for 1, follow
2022-03-28 14:10:52.948 [76330] main/118/raft_worker I> RAFT: persisted state {term: 4, vote: 1}
2022-03-28 14:10:53.356 [76330] main/114/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica3.sock I> RAFT: message {term: 5, vote: 1, state: candidate, vclock: {1: 9, 3: 3}} from 1
2022-03-28 14:10:53.356 [76330] main/114/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica3.sock I> RAFT: bump term to 5, follow
2022-03-28 14:10:53.356 [76330] main/114/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica3.sock I> RAFT: vote for 1, follow
2022-03-28 14:10:53.356 [76330] main/118/raft_worker I> RAFT: persisted state {term: 5, vote: 1}
2022-03-28 14:10:53.767 [76330] main I> RAFT: begin new election round
2022-03-28 14:10:53.767 [76330] main I> RAFT: bump term to 6, follow
2022-03-28 14:10:53.767 [76330] main I> RAFT: vote for 2, follow
2022-03-28 14:10:53.767 [76330] main/118/raft_worker I> RAFT: persisted state {term: 6, vote: 2}
2022-03-28 14:10:53.767 [76330] main/118/raft_worker I> RAFT: enter candidate state with 1 self vote
2022-03-28 14:10:53.781 [76330] main/114/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica3.sock I> RAFT: message {term: 6, vote: 1, state: candidate, vclock: {1: 9, 3: 3}} from 1
2022-03-28 14:10:53.781 [76330] main/114/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica3.sock I> RAFT: vote request is skipped - competing candidate
2022-03-28 14:10:54.179 [76330] main I> RAFT: begin new election round
replica3:
2022-03-28 14:10:49.513 [2385] main/116/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica2.sock I> can't read row
2022-03-28 14:10:49.513 [2385] main/116/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica2.sock xrow.c:229 E> ER_INVALID_MSGPACK: Invalid MsgPack - packet body
2022-03-28 14:10:49.587 [2385] main/127/main I> subscribed replica b5e078b0-b862-4d11-b726-7dba38835615 at fd 27, aka unix/:(socket), peer of unix/:(socket)
2022-03-28 14:10:49.587 [2385] main/127/main I> remote vclock {1: 9, 3: 3} local vclock {0: 3, 1: 9, 3: 3}
2022-03-28 14:10:49.588 [2385] main/106/gc I> wal/engine cleanup is resumed
2022-03-28 14:10:49.592 [2385] relay/unix/:(socket)/101/main I> recover from `/workspaces/t.sergos/test/var/083_replication/election_replica3/00000000000000000011.xlog'
2022-03-28 14:10:52.663 [2385] main/117/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica1.sock I> can't read row
2022-03-28 14:10:52.663 [2385] main/117/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica1.sock coio.c:349 E> SocketError: unexpected EOF when reading from socket, called on fd 28, aka unix/:(socket), peer of unix/:(socket): Broken pipe
2022-03-28 14:10:52.663 [2385] main/117/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica1.sock I> will retry every 0.10 second
2022-03-28 14:10:52.663 [2385] relay/unix/:(socket)/101/main coio.c:349 E> SocketError: unexpected EOF when reading from socket, called on fd 27, aka unix/:(socket), peer of unix/:(socket): Broken pipe
2022-03-28 14:10:52.663 [2385] relay/unix/:(socket)/101/main I> exiting the relay loop
2022-03-28 14:10:52.947 [2385] main I> RAFT: begin new election round
2022-03-28 14:10:52.947 [2385] main I> RAFT: bump term to 4, follow
2022-03-28 14:10:52.947 [2385] main I> RAFT: vote for 1, follow
2022-03-28 14:10:52.947 [2385] main/118/raft_worker I> RAFT: persisted state {term: 4, vote: 1}
2022-03-28 14:10:52.948 [2385] main/118/raft_worker I> RAFT: enter candidate state with 1 self vote
2022-03-28 14:10:53.355 [2385] main I> RAFT: begin new election round
2022-03-28 14:10:53.355 [2385] main I> RAFT: bump term to 5, follow
I've ran election_qsync_stress
for quite some time, but couldn't reproduce the election hang that @sergos mentioned.
The ER_READONLY error was fixed recently in scope of tarantool/tarantool#6966
It seems that the case with infinite elections is caused by replica3
's applier failing with ER_INVVALID_MSGPACK
. This explains why two instances may enter the infinite election loop. replica3 doesn't receive vote requests from replica2 (its applier is dead), so replica2 never receives 2 votes. At the same time, replica3 never receives votes for itself from replica2, so replica3 never receives 2 votes. This results in infinite elections.
What causes the ER_INVALID_MSGPACK
is a mystery though. It was fixed in scope of tarantool/tarantool#4040, so this might be another incarnation of this bug.
@sergos, could you please set log_level=6
in election_replica.lua
and try to reproduce this once again? With log_level=6
replica will dump the malformed row once ER_INVALID_MSGPACK occurs. This will give us a chance to understand what's causing the error.
I've got the error, but not the infinite elections. The reason surprises me: it's in relay, not applier:
2022-04-26 11:26:13.548 [53611] main/138/main I> set 'replication_synchro_timeout' configuration option to 1000
2022-04-26 11:26:13.717 [53611] main/113/main I> subscribed replica 38acc283-910b-40db-8ecc-e5ab45764352 at fd 23, aka unix/:(socket), peer of unix/:(socket)
2022-04-26 11:26:13.717 [53611] main/113/main I> remote vclock {1: 9} local vclock {0: 2, 1: 9, 3: 2}
2022-04-26 11:26:13.718 [53611] relay/unix/:(socket)/101/main I> recover from `/workspaces/t.sergos/test/var/027_replication/election_replica2/00000000000000000005.xlog'
2022-04-26 11:26:13.718 [53611] relay/unix/:(socket)/102/main:reader V> Got a corrupted row:
2022-04-26 11:26:13.718 [53611] relay/unix/:(socket)/102/main:reader V> 00000000: 81 00 00 81 26 81 01 09 03 01
2022-04-26 11:26:13.719 [53611] relay/unix/:(socket)/101/main xrow.c:229 E> ER_INVALID_MSGPACK: Invalid MsgPack - packet body
2022-04-26 11:26:13.719 [53611] relay/unix/:(socket)/101/main I> exiting the relay loop
2022-04-26 11:26:13.761 [53611] main/114/applier/unix/:/workspaces/t.sergos/test/var/027_replication/election_replica1.sock I> subscribed
2022-04-26 11:26:13.761 [53611] main/114/applier/unix/:/workspaces/t.sergos/test/var/027_replication/election_replica1.sock I> remote vclock {1: 9, 3: 2} local vclock {0: 2, 1: 9, 3: 2}
2022-04-26 11:26:13.761 [53611] main/114/applier/unix/:/workspaces/t.sergos/test/var/027_replication/election_replica1.sock I> RAFT: message {term: 3, leader: 3, leader is seen: true, state: follower} from 1
unfortunately, the xlog file is missing, even after I removed the test from the fragile list.
I run tests on the dev1
box under ubuntu docker, using a bind
mount type
"Mounts": [
{
"Type": "bind",
"Source": "/home/s.ostanevich/workspaces",
"Destination": "/workspaces",
"Mode": "",
"RW": true,
"Propagation": "rprivate"
}
],
I will try to reproduce without docker.
The reason surprises me: it's in relay, not applier:
Wow, that's something already. Will take a look.
Reproduced on plain el7.
2022-04-26 17:32:05.853 [36183] main/113/main I> subscribed replica 92a2b565-790f-4a2c-aa4e-2264a93a1d8e at fd 29, aka unix/:(socket), peer of unix/:(socket)
2022-04-26 17:32:05.854 [36183] main/113/main I> remote vclock {1: 5} local vclock {0: 1, 1: 5}
>>>> 2022-04-26 17:32:05.857 [36183] relay/unix/:(socket)/101/main I> recover from `/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica2/00000000000000000004.xlog'
2022-04-26 17:32:05.866 [36183] main/127/main I> subscribed replica 70f0ed78-3d15-4bcc-9fd0-87ec1ba7ea20 at fd 31, aka unix/:(socket), peer of unix/:(socket)
2022-04-26 17:32:05.866 [36183] main/127/main I> remote vclock {1: 5} local vclock {0: 1, 1: 5}
2022-04-26 17:32:05.866 [36183] main/127/main I> RAFT: fencing resumed
2022-04-26 17:32:05.867 [36183] main/106/gc I> wal/engine cleanup is resumed
>>>> 2022-04-26 17:32:05.868 [36183] relay/unix/:(socket)/101/main I> recover from `/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica2/00000000000000000004.xlog'
2022-04-26 17:32:05.911 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> subscribed
2022-04-26 17:32:05.911 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> remote vclock {1: 5} local vclock {0: 1, 1: 5}
2022-04-26 17:32:05.912 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: message {term: 2, leader: 1, leader is seen: true, state: follower} from 3
2022-04-26 17:32:05.914 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: message {term: 2, leader: 1, leader is seen: true, state: follower} from 3
2022-04-26 17:32:05.914 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> leaving orphan mode
2022-04-26 17:32:07.011 [36183] main/115/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica1.sock I> can't read row
2022-04-26 17:32:07.011 [36183] main/115/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica1.sock coio.c:349 E> SocketError: unexpected EOF when reading from socket, called on fd 25, aka unix/:(socket), peer of unix/:(socket): Broken pipe
2022-04-26 17:32:07.011 [36183] main/115/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica1.sock I> will retry every 0.10 second
>>>> 2022-04-26 17:32:07.011 [36183] relay/unix/:(socket)/101/main coio.c:349 E> SocketError: unexpected EOF when reading from socket, called on fd 31, aka unix/:(socket), peer of unix/:(socket): Broken pipe
2022-04-26 17:32:07.011 [36183] relay/unix/:(socket)/101/main I> exiting the relay loop
2022-04-26 17:32:07.353 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: message {term: 2, leader: 1, state: follower} from 3
2022-04-26 17:32:07.353 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: begin new election round
2022-04-26 17:32:07.353 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: bump term to 3, follow
2022-04-26 17:32:07.353 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: vote for 2, follow
2022-04-26 17:32:07.353 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: message {term: 3, vote: 3, state: candidate, vclock: {1: 9}} from 3
2022-04-26 17:32:07.353 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: vote request is skipped - already voted in this term
2022-04-26 17:32:07.353 [36183] main/118/raft_worker I> RAFT: persisted state {term: 3, vote: 2}
2022-04-26 17:32:07.353 [36183] main/118/raft_worker I> RAFT: enter candidate state with 1 self vote
2022-04-26 17:32:07.766 [36183] main I> RAFT: begin new election round
2022-04-26 17:32:07.766 [36183] main I> RAFT: bump term to 4, follow
2022-04-26 17:32:07.766 [36183] main I> RAFT: vote for 2, follow
2022-04-26 17:32:07.766 [36183] main/118/raft_worker I> RAFT: persisted state {term: 4, vote: 2}
2022-04-26 17:32:07.766 [36183] main/118/raft_worker I> RAFT: enter candidate state with 1 self vote
2022-04-26 17:32:07.767 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: message {term: 4, vote: 2, state: follower} from 3
2022-04-26 17:32:07.767 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: enter leader state with quorum 2
>>>> 2022-04-26 17:32:07.767 [36183] relay/unix/:(socket)/102/main:reader V> Got a corrupted row:
2022-04-26 17:32:07.767 [36183] relay/unix/:(socket)/102/main:reader V> 00000000: 81 00 00 81 26 81 01 09 02 01
2022-04-26 17:32:07.767 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: message {term: 4, vote: 2, leader: 2, leader is seen: true, state: follower} from 3
2022-04-26 17:32:07.767 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: vote request is skipped - the leader is already known - 2
>>>> 2022-04-26 17:32:07.767 [36183] relay/unix/:(socket)/101/main xrow.c:229 E> ER_INVALID_MSGPACK: Invalid MsgPack - packet body
2022-04-26 17:32:07.767 [36183] relay/unix/:(socket)/101/main I> exiting the relay loop
Found the issue together with @sergos: tarantool/tarantool#7089
Seems like we've fixed all the known issues with this test. No failures on my machine after a 1000 runs. Closing.
Tarantool version: Tarantool 2.6.0-143-g0dc72812fb Target: Linux-x86_64-Debug Build options: cmake . -DCMAKE_INSTALL_PREFIX=/usr/local -DENABLE_BACKTRACE=ON Compiler: /usr/bin/cc /usr/bin/c++ C_FLAGS: -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -fopenmp -msse2 -fprofile-arcs -ftest-coverage -std=c11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-format-truncation -Wno-gnu-alignof-expression -fno-gnu89-inline -Wno-cast-function-type -Werror CXX_FLAGS: -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -fopenmp -msse2 -fprofile-arcs -ftest-coverage -std=c++11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-format-truncation -Wno-invalid-offsetof -Wno-gnu-alignof-expression -Wno-cast-function-type -Werror
OS version:
Bug description: Found 2 issues:
artifacts.zip
results file checksum: cc93d7c69c6368217634718bdf3de16c
artifacts.zip
results file checksum: 3fb2e6cef4c8fa1d0edd8654fd2d8ef6
artifacts.zip
results file checksum: 634bda94accdcdef7b1db3e14f28f445
https://gitlab.com/tarantool/tarantool/-/jobs/795376806#L5359
artifacts.zip
results file checksum: 36bcdae426c18a60fd13025c09f197d0
5. https://gitlab.com/tarantool/tarantool/-/jobs/795916093#L5738
artifacts.zip
results file checksum: 209c865525154a91435c63850f15eca0
Steps to reproduce:
Optional (but very desirable):