tarantool / tarantool-qa

QA related issues of Tarantool
3 stars 0 forks source link

test: flaky replication/election_qsync_stress.test.lua test #166

Closed avtikhon closed 2 years ago

avtikhon commented 4 years ago

Tarantool version: Tarantool 2.6.0-143-g0dc72812fb Target: Linux-x86_64-Debug Build options: cmake . -DCMAKE_INSTALL_PREFIX=/usr/local -DENABLE_BACKTRACE=ON Compiler: /usr/bin/cc /usr/bin/c++ C_FLAGS: -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -fopenmp -msse2 -fprofile-arcs -ftest-coverage -std=c11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-format-truncation -Wno-gnu-alignof-expression -fno-gnu89-inline -Wno-cast-function-type -Werror CXX_FLAGS: -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -fopenmp -msse2 -fprofile-arcs -ftest-coverage -std=c++11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-format-truncation -Wno-invalid-offsetof -Wno-gnu-alignof-expression -Wno-cast-function-type -Werror

OS version:

  1. Linux (Debian 9)
  2. FreeBSD 12

Bug description: Found 2 issues:

  1. https://gitlab.com/tarantool/tarantool/-/jobs/779113169#L5139

artifacts.zip

results file checksum: cc93d7c69c6368217634718bdf3de16c

[035] replication/election_qsync.test.lua             vinyl           
[035] TarantoolInpector.handle() received the following error:
[035] Traceback (most recent call last):
[035]   File "/builds/M4RrgQZ3/0/tarantool/tarantool/test-run/lib/inspector.py", line 94, in handle
[035]     result = self.parser.parse_preprocessor(line)
[035]   File "/builds/M4RrgQZ3/0/tarantool/tarantool/test-run/lib/preprocessor.py", line 73, in parse_preprocessor
[035]     return self.lua_eval(name, expr[1:-1])
[035]   File "/builds/M4RrgQZ3/0/tarantool/tarantool/test-run/lib/preprocessor.py", line 388, in lua_eval
[035]     self.servers[name].admin.reconnect()
[035] KeyError: 'election_replica0'
[035] [ fail ]
[035] 
[035] Test failed! Result content mismatch:
[035] --- replication/election_qsync.result Thu Oct  8 00:07:13 2020
[035] +++ replication/election_qsync.reject Thu Oct  8 04:11:02 2020
[035] @@ -102,27 +102,3 @@
[035]      old_leader_nr = new_leader_nr
[035]      old_leader = new_leader
[035]  end;
[035] - | ---
[035] - | ...
[035] -test_run:cmd('setopt delimiter ""');
[035] - | ---
[035] - | - true
[035] - | ...
[035] --- We're connected to some leader.
[035] -c.space.test:select{}
[035] - | ---
[035] - | - - [1]
[035] - |   - [2]
[035] - |   - [3]
[035] - |   - [4]
[035] - |   - [5]
[035] - |   - [6]
[035] - |   - [7]
[035] - |   - [8]
[035] - |   - [9]
[035] - |   - [10]
[035] - | ...
[035] -
[035] -test_run:drop_cluster(SERVERS)
[035] - | ---
[035] - | ...
[035] 
  1. https://gitlab.com/tarantool/tarantool/-/jobs/779113714#L5729

artifacts.zip

results file checksum: 3fb2e6cef4c8fa1d0edd8654fd2d8ef6

[021] replication/election_qsync.test.lua             vinyl           
[021] TarantoolInpector.handle() received the following error:
[021] Traceback (most recent call last):
[021]   File "/home/vagrant/tarantool/test-run/lib/inspector.py", line 94, in handle
[021]     result = self.parser.parse_preprocessor(line)
[021]   File "/home/vagrant/tarantool/test-run/lib/preprocessor.py", line 106, in parse_preprocessor
[021]     return self.server(stype, sname, options)
[021]   File "/home/vagrant/tarantool/test-run/lib/preprocessor.py", line 338, in server
[021]     return getattr(self, attr)(ctype, sname, opts)
[021]   File "/home/vagrant/tarantool/test-run/lib/preprocessor.py", line 218, in server_stop
[021]     self.connections[sname].disconnect()
[021] KeyError: 'election_replica2'
[021] [ fail ]
[021] 
[021] Test failed! Result content mismatch:
[021] --- replication/election_qsync.result Thu Oct  8 04:07:26 2020
[021] +++ replication/election_qsync.reject Fri May  8 08:24:37 2020
[021] @@ -103,6 +103,7 @@
[021]      old_leader = new_leader
[021]  end;
[021]   | ---
[021] + | - error: Can't modify data because this instance is in read-only mode.
[021]   | ...
[021]  test_run:cmd('setopt delimiter ""');
[021]   | ---
[021] @@ -113,16 +114,6 @@
[021]   | ---
[021]   | - - [1]
[021]   |   - [2]
[021] - |   - [3]
[021] - |   - [4]
[021] - |   - [5]
[021] - |   - [6]
[021] - |   - [7]
[021] - |   - [8]
[021] - |   - [9]
[021] - |   - [10]
[021]   | ...
[021]  
[021]  test_run:drop_cluster(SERVERS)
[021] - | ---
[021] - | ...
[021] 
  1. https://gitlab.com/tarantool/tarantool/-/jobs/795044495#L5273

artifacts.zip

results file checksum: 634bda94accdcdef7b1db3e14f28f445

[155] replication/election_qsync_stress.test.lua      memtx           
[155] TarantoolInpector.handle() received the following error:
[155] Traceback (most recent call last):
[155]   File "/builds/4BUsapPU/0/tarantool/tarantool/test-run/lib/inspector.py", line 94, in handle
[155]     result = self.parser.parse_preprocessor(line)
[155]   File "/builds/4BUsapPU/0/tarantool/tarantool/test-run/lib/preprocessor.py", line 106, in parse_preprocessor
[155]     return self.server(stype, sname, options)
[155]   File "/builds/4BUsapPU/0/tarantool/tarantool/test-run/lib/preprocessor.py", line 338, in server
[155]     return getattr(self, attr)(ctype, sname, opts)
[155]   File "/builds/4BUsapPU/0/tarantool/tarantool/test-run/lib/preprocessor.py", line 218, in server_stop
[155]     self.connections[sname].disconnect()
[155] KeyError: 'election_replica3'
[155] [ fail ]
[155] 
[155] Test failed! Result content mismatch:
[155] --- replication/election_qsync_stress.result  Fri Oct 16 11:31:28 2020
[155] +++ replication/election_qsync_stress.reject  Fri Oct 16 12:12:35 2020
[155] @@ -102,6 +102,7 @@
[155]      old_leader = new_leader
[155]  end;
[155]   | ---
[155] + | - error: Can't modify data because this instance is in read-only mode.
[155]   | ...
[155]  test_run:cmd('setopt delimiter ""');
[155]   | ---
[155] @@ -110,9 +111,7 @@
[155]  -- We're connected to some leader.
[155]  #c.space.test:select{} == 10 or require('log').error(c.space.test:select{})
[155]   | ---
[155] - | - true
[155] + | - null
[155]   | ...
[155]  
[155]  test_run:drop_cluster(SERVERS)
[155] - | ---
[155] - | ...
[155] 
  1. https://gitlab.com/tarantool/tarantool/-/jobs/795376806#L5359

artifacts.zip

results file checksum: 36bcdae426c18a60fd13025c09f197d0

[164] --- replication/election_qsync_stress.result  Fri Oct 16 14:39:00 2020
[164] +++ replication/election_qsync_stress.reject  Fri Oct 16 14:48:14 2020
[164] @@ -110,7 +110,7 @@
[164]  -- We're connected to some leader.
[164]  #c.space.test:select{} == 10 or require('log').error(c.space.test:select{})
[164]   | ---
[164] - | - true
[164] + | - null
[164]   | ...
[164]  
[164]  test_run:drop_cluster(SERVERS)
[164] 

5. https://gitlab.com/tarantool/tarantool/-/jobs/795916093#L5738

artifacts.zip

results file checksum: 209c865525154a91435c63850f15eca0

[027] --- replication/election_qsync_stress.result  Fri Oct 16 19:25:53 2020
[027] +++ replication/election_qsync_stress.reject  Fri May  8 08:23:59 2020
[027] @@ -102,6 +102,7 @@
[027]      old_leader = new_leader
[027]  end;
[027]   | ---
[027] + | - error: Found uncommitted sync transactions from other instance with id 2
[027]   | ...
[027]  test_run:cmd('setopt delimiter ""');
[027]   | ---
[027] @@ -110,9 +111,7 @@
[027]  -- We're connected to some leader.
[027]  #c.space.test:select{} == 10 or require('log').error(c.space.test:select{})
[027]   | ---
[027] - | - true
[027] + | - null
[027]   | ...
[027]  
[027]  test_run:drop_cluster(SERVERS)
[027] - | ---
[027] - | ...
[027] 

Steps to reproduce:

Optional (but very desirable):

PersDep commented 4 years ago

It also catches assertion time to time:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
tarantool/tarantool#1  0x00007f0cf07da8b1 in __GI_abort () at abort.c:79
tarantool/tarantool#2  0x00007f0cf07ca42a in __assert_fail_base (fmt=0x7f0cf0951a38 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x55e86001da89 "lsn > limbo->confirmed_lsn", 
    file=file@entry=0x55e86001d708 "/home/kosarev/tarantool/src/box/txn_limbo.c", line=line@entry=340, function=function@entry=0x55e86001dc40 <__PRETTY_FUNCTION__.12489> "txn_limbo_write_confirm")
    at assert.c:92
tarantool/tarantool#3  0x00007f0cf07ca4a2 in __GI___assert_fail (assertion=0x55e86001da89 "lsn > limbo->confirmed_lsn", file=0x55e86001d708 "/home/kosarev/tarantool/src/box/txn_limbo.c", line=340, 
    function=0x55e86001dc40 <__PRETTY_FUNCTION__.12489> "txn_limbo_write_confirm") at assert.c:101
tarantool/tarantool#4  0x000055e85fbdb753 in txn_limbo_write_confirm (limbo=0x55e860416a60 <txn_limbo>, lsn=9) at /home/kosarev/tarantool/src/box/txn_limbo.c:340
tarantool/tarantool#5  0x000055e85fbdc328 in txn_limbo_force_empty (limbo=0x55e860416a60 <txn_limbo>, confirm_lsn=9) at /home/kosarev/tarantool/src/box/txn_limbo.c:622
tarantool/tarantool#6  0x000055e85fbe291e in box_clear_synchro_queue () at /home/kosarev/tarantool/src/box/box.cc:1071
tarantool/tarantool#7  0x000055e85fcb5cd3 in lbox_ctl_clear_synchro_queue (L=0x4151c0b8) at /home/kosarev/tarantool/src/box/lua/ctl.c:85
tarantool/tarantool#8  0x000055e85fe400cb in lj_BC_FUNCC ()
tarantool/tarantool#9  0x000055e85fdfb84b in jit_secure_call (L=0x4151c0b8, base=0x4151c118, nres=0) at lj_api.c:88
tarantool/tarantool#10 0x000055e85fe02ae0 in lua_call (L=0x4151c0b8, nargs=0, nresults=-1) at lj_api.c:1133
tarantool/tarantool#11 0x000055e85fca3b94 in execute_lua_eval (L=0x4151c0b8) at /home/kosarev/tarantool/src/box/lua/call.c:394
tarantool/tarantool#12 0x000055e85fe400cb in lj_BC_FUNCC ()
tarantool/tarantool#13 0x000055e85fe02e15 in lua_pcall (L=0x4151c0b8, nargs=1, nresults=-1, errfunc=0) at lj_api.c:1158
tarantool/tarantool#14 0x000055e85fd87bd6 in luaT_call (L=0x4151c0b8, nargs=1, nreturns=-1) at /home/kosarev/tarantool/src/lua/utils.c:1016
tarantool/tarantool#15 0x000055e85fca42e1 in box_process_lua (handler=HANDLER_EVAL, ctx=0x7f0ce79ffd10, ret=0x7f0ce79ffe60) at /home/kosarev/tarantool/src/box/lua/call.c:570
tarantool/tarantool#16 0x000055e85fca43ff in box_lua_eval (expr=0x7f0ce701803d "box.ctl.clear_synchro_queue()!\220\021", expr_len=29, args=0x7f0ce79ffd80, ret=0x7f0ce79ffe60)
    at /home/kosarev/tarantool/src/box/lua/call.c:614
tarantool/tarantool#17 0x000055e85fc130bb in box_process_eval (request=0x7f0ce7010430, port=0x7f0ce79ffe60) at /home/kosarev/tarantool/src/box/call.c:187
tarantool/tarantool#18 0x000055e85fcc75f4 in tx_process_call (m=0x7f0ce70103b0) at /home/kosarev/tarantool/src/box/iproto.cc:1632
tarantool/tarantool#19 0x000055e85fdb062e in cmsg_deliver (msg=0x7f0ce70103b0) at /home/kosarev/tarantool/src/lib/core/cbus.c:375
tarantool/tarantool#20 0x000055e85fdb170a in fiber_pool_f (ap=0x7f0ced801008) at /home/kosarev/tarantool/src/lib/core/fiber_pool.c:64
tarantool/tarantool#21 0x000055e85fba0269 in fiber_cxx_invoke(fiber_func, typedef __va_list_tag __va_list_tag *) (f=0x55e85fdb156a <fiber_pool_f>, ap=0x7f0ced801008) at /home/kosarev/tarantool/src/lib/core/fiber.h:788
tarantool/tarantool#22 0x000055e85fdaae15 in fiber_loop (data=0x0) at /home/kosarev/tarantool/src/lib/core/fiber.c:879
tarantool/tarantool#23 0x000055e86001556d in coro_init () at /home/kosarev/tarantool/third_party/coro/coro.c:110
Gerold103 commented 4 years ago

A minimal 100% reproducer. Follow the steps in the specified order on the specified instances. However it is not related to why the test hangs sometimes. This I don't know.

-- Instance 1
-- Step 1
fiber = require('fiber')
box.cfg{
    listen = 3313,
    replication = {'localhost:3313', 'localhost:3314'},
    replication_synchro_quorum = 2,
    replication_synchro_timeout = 1000000,
    read_only = false,
}
box.schema.user.grant('guest', 'super')
s1 = box.schema.create_space('test1', {is_sync = true})
_ = s1:create_index('pk')

-- Step 3
for i = 1, 10 do s1:replace{i} end

-- Step 5
box.cfg{
    replication_synchro_timeout = 0.001,
}
box.ctl.clear_synchro_queue()
-- Instance 2
-- Step 2
box.cfg{
    listen = 3314,
    replication = {'localhost:3313', 'localhost:3314'},
    replication_synchro_quorum = 3,
    replication_synchro_timeout = 1000000,
    read_only = true,
}

-- Step 4
box.cfg{read_only = false}
box.space.test1:replace{11}
Totktonada commented 4 years ago

I didn't look deeply, but I see that fails of the test (all kinds of them except the rare one with the error Found uncommitted sync transactions from other instance with id <..>) are accompanied with the RAFT: vote request is skipped - the vclock is not acceptable log message, while successful runs do not have such log message most of times (but not always).

A minimal 100% reproducer. Follow the steps in the specified order on the specified instances.

I don't see anything that would look like an error or inconsistency in the space data on 2.7.0-2-g66bb1a507 and 2.7.0-32-g77c218cc1 (debug builds). Maybe I doing something wrong (just copy-pasted the snippets to two consoles in the specified order).

It also catches assertion time to time.

Didn't catched it (I'm on debug build of version 2.7.0-16-gbc6caac60, on Linux).

Gerold103 commented 3 years ago

You are too late. The reproducer was for the crash, which I already fixed in https://github.com/tarantool/tarantool/commit/41ba1479abf630b378c3712ef52aad7a823f1b42. Unfortunately, the issue is clogged with tons "avtikhon added a commit that referenced this issue yesterday" message, so you could miss my commit above.

The flakiness is not related to the crash, so I didn't close the issue.

Gerold103 commented 2 years ago

Cases 1 and 2 are not present in election_qsync.test.lua file at all. They existed in election_qsync_stress.test.lua in the beginning but in https://github.com/tarantool/tarantool/commit/eb989bc7c68fa3be0b8dcfc8566a497f4013708a they were merged with other failures having the same reason.

Cases 4 and 5 were failing because the test didn't wait for the old leader to send all data to the other nodes. As a result, they could have less data than needed and the space didn't have enough tuples in the end of the test. It seems to be fixed by https://github.com/tarantool/tarantool/commit/bf0fbf3a05c5db05be65fb1bce8b1ed9b959e607.

Cases 3 and 4 (the latter had 2 issues) seem to be caused by old code relying on box.info.election.role == 'leader' being enough to be writable. That is still a problem (for tarantool/tarantool#6689), but the test got a workaround in this commit: https://github.com/tarantool/tarantool/commit/71377c28e1c20108a8691481660bea6263f9a2e8.

I have run both election_qsync.test.lua and its stress version on my machine for half an hour and couldn't get any failures. Probably it still fails in CI, but I believe due to other reasons, not related to this ticket already.

As a summary: I think @sergepetrenko fixed this ticket quite some time ago in the commits above. Sergey, can you try it on your machine too? @sergos, you could also try (me and Sergey P. both are on Macs I think, we could use a person with Linux)?

Gerold103 commented 2 years ago

The most recent new error not in this ticket on master I see here: https://github.com/tarantool/tarantool/runs/5590105468?check_suite_focus=true. Have no idea what might be causing it. It fails on bootstrap apparently.

sergos commented 2 years ago

After running of ./test-run.py $(yes election_qsync | head -100) couple of times I have got:

[141] replication/election_qsync_stress.test.lua      vinyl           
[141] * [QA Notice]
[141] *
[141] * Attempt to evaluate a command on the nonexistent server 'election_replica0'
[141] *
[141] 
[141] * [QA Notice]
[141] *
[141] * Attempt to stop already stopped server 'election_replica1'
[141] *
[141] [ fail ]
[141] 
[141] Test failed! Result content mismatch:
[141] --- replication/election_qsync_stress.result      Fri Feb 11 13:34:49 2022
[141] +++ var/rejects/replication/election_qsync_stress.reject  Wed Mar 23 18:26:30 2022
[141] @@ -111,6 +111,7 @@
[141]      old_leader = new_leader
[141]  end;
[141]   | ---
[141] + | - error: Attempt to evaluate a command on the nonexistent server 'election_replica0'
[141]   | ...
[141]  test_run:cmd('setopt delimiter ""');
[141]   | ---
[141] @@ -119,7 +120,7 @@
[141]  -- We're connected to some leader.
[141]  #c.space.test:select{} == 10 or require('log').error(c.space.test:select{})
[141]   | ---
[141] - | - true
[141] + | - error: Peer closed
[141]   | ...
[141]  
[141]  test_run:drop_cluster(SERVERS)
[141] 
[141] Last 15 lines of Tarantool Log file [Instance "master"][/workspaces/t.sergos/test/var/141_replication/master.log]:
[141] 2022-03-23 18:25:18.773 [15158] main/121/console/unix/: I> election_replica1 -> election_replica2: waiting for connection
[141] 2022-03-23 18:25:18.792 [15158] main/121/console/unix/: I> connecting election_replica1 -> election_replica2: {"status":"follow"}
[141] 2022-03-23 18:25:18.792 [15158] main/121/console/unix/: I> election_replica1 -> election_replica2: connected
[141] 2022-03-23 18:25:18.792 [15158] main/121/console/unix/: I> election_replica3 -> election_replica2: waiting for connection
[141] 2022-03-23 18:25:18.816 [15158] main/121/console/unix/: I> connecting election_replica3 -> election_replica2: {"status":"follow"}
[141] 2022-03-23 18:25:18.816 [15158] main/121/console/unix/: I> election_replica3 -> election_replica2: connected
[141] 2022-03-23 18:25:18.816 [15158] main/121/console/unix/: I> election_replica3: waiting bootstrap
[141] 2022-03-23 18:25:18.831 [15158] main/121/console/unix/: I> election_replica3: bootstrapped
[141] 2022-03-23 18:25:18.831 [15158] main/121/console/unix/: I> election_replica1 -> election_replica3: waiting for connection
[141] 2022-03-23 18:25:18.862 [15158] main/121/console/unix/: I> connecting election_replica1 -> election_replica3: {"status":"follow"}
[141] 2022-03-23 18:25:18.862 [15158] main/121/console/unix/: I> election_replica1 -> election_replica3: connected
[141] 2022-03-23 18:25:18.862 [15158] main/121/console/unix/: I> election_replica2 -> election_replica3: waiting for connection
[141] 2022-03-23 18:25:18.877 [15158] main/121/console/unix/: I> connecting election_replica2 -> election_replica3: {"status":"follow"}
[141] 2022-03-23 18:25:18.878 [15158] main/121/console/unix/: I> election_replica2 -> election_replica3: connected
[141] 2022-03-23 18:25:18.878 [15158] main/121/console/unix/: I> full mesh connected
[141] Test "replication/election_qsync_stress.test.lua", conf: "vinyl"
[141]   from "fragile" list failed, rerunning ...

Which apparently have one way to appear:

    new_leader_nr = get_leader(nrs)
    new_leader = 'election_replica'..new_leader_nr
    leader_port = test_run:eval(new_leader, 'box.cfg.listen')[1]

or the get_leader() itself, where 0 can be returned, so it will be used in eval.

sergos commented 2 years ago

Got another failure:

[081] replication/election_qsync_stress.test.lua      memtx           
[081] 
[081] [Instance "election_replica2" returns with non-zero exit code: 1]
[081] 
[081] Last 15 lines of Tarantool Log file [Instance "election_replica2"][/workspaces/t.sergos/test/var/081_replication/election_replica2.log]:
[081] 2022-03-28 13:40:14.541 [30745] main/115/applier/unix/:/workspaces/t.sergos/test/var/081_replication/election_replica3.sock sio.c:210 E> SocketError: connect to unix/:/workspaces/t.sergos/test/var/081_replication/election_replica3.sock, called on fd 26, aka unix/:(socket): No such file or directory
[081] 2022-03-28 13:40:14.541 [30745] main/115/applier/unix/:/workspaces/t.sergos/test/var/081_replication/election_replica3.sock I> will retry every 0.10 second
[081] 2022-03-28 13:40:14.541 [30745] main/113/applier/unix/:/workspaces/t.sergos/test/var/081_replication/election_replica1.sock I> remote master 6dd0c5ad-e4f9-4d3f-adc5-25d83bc0db84 at unix/:/workspaces/t.sergos/test/var/081_replication/election_replica1.sock running Tarantool 2.10.0
[081] 2022-03-28 13:40:14.543 [30745] main/114/applier/unix/:/workspaces/t.sergos/test/var/081_replication/election_replica2.sock I> remote master afebf07a-a327-4890-8eaa-7f8cfc159eaa at unix/:/workspaces/t.sergos/test/var/081_replication/election_replica2.sock running Tarantool 2.10.0
[081] 2022-03-28 13:40:14.845 [30745] main/115/applier/unix/:/workspaces/t.sergos/test/var/081_replication/election_replica3.sock I> remote master e9e613a7-bbc2-4250-9f3b-e13ad7e46b45 at unix/:/workspaces/t.sergos/test/var/081_replication/election_replica3.sock running Tarantool 2.10.0
[081] 2022-03-28 13:40:14.846 [30745] main/103/election_replica2 I> connected to 3 replicas
[081] 2022-03-28 13:40:14.847 [30745] main/103/election_replica2 I> bootstrapping replica from e9e613a7-bbc2-4250-9f3b-e13ad7e46b45 at unix/:/workspaces/t.sergos/test/var/081_replication/election_replica3.sock
[081] 2022-03-28 13:40:14.847 [30745] main/115/applier/unix/:/workspaces/t.sergos/test/var/081_replication/election_replica3.sock I> can't read row
[081] 2022-03-28 13:40:14.847 [30745] main/115/applier/unix/:/workspaces/t.sergos/test/var/081_replication/election_replica3.sock box.cc:2869 E> ER_LOADING: Instance bootstrap hasn't finished yet
[081] 2022-03-28 13:40:14.847 [30745] main/115/applier/unix/:/workspaces/t.sergos/test/var/081_replication/election_replica3.sock I> will retry every 0.10 second
[081] 2022-03-28 13:40:14.947 [30745] main/115/applier/unix/:/workspaces/t.sergos/test/var/081_replication/election_replica3.sock I> can't read row
[081] 2022-03-28 13:40:14.947 [30745] main/115/applier/unix/:/workspaces/t.sergos/test/var/081_replication/election_replica3.sock box.cc:198 E> ER_READONLY: Can't modify data on a read-only instance - state is election follower with term 2, leader is 1 (6dd0c5ad-e4f9-4d3f-adc5-25d83bc0db84)
[081] 2022-03-28 13:40:14.947 [30745] main/103/election_replica2 box.cc:198 E> ER_READONLY: Can't modify data on a read-only instance - state is election follower with term 2, leader is 1 (6dd0c5ad-e4f9-4d3f-adc5-25d83bc0db84)
[081] 2022-03-28 13:40:14.947 [30745] main/103/election_replica2 F> can't initialize storage: Can't modify data on a read-only instance - state is election follower with term 2, leader is 1 (6dd0c5ad-e4f9-4d3f-adc5-25d83bc0db84)
[081] 2022-03-28 13:40:14.947 [30745] main/103/election_replica2 F> can't initialize storage: Can't modify data on a read-only instance - state is election follower with term 2, leader is 1 (6dd0c5ad-e4f9-4d3f-adc5-25d83bc0db84)
[081] [ fail ]
sergos commented 2 years ago

or the get_leader() itself, where 0 can be returned, so it will be used in eval.

Apparently, we have an infinite elections - replica2 (id == 2) tried to deliver it's vote to the replica3 (id == 1) and after two sequential votes for replica3 it starts to vote for itself. Note, that replica2 keep receiving votes from replica3, but not the vice versa.

replica2:
2022-03-28 14:10:52.663 [76330] main/112/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica1.sock I> can't read row
2022-03-28 14:10:52.663 [76330] main/112/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica1.sock coio.c:349 E> SocketError: unexpected EOF when reading from socket, called on fd 26, aka unix/:(socket), peer of unix/:(socket): Broken pipe
2022-03-28 14:10:52.663 [76330] main/112/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica1.sock I> will retry every 0.10 second
2022-03-28 14:10:52.663 [76330] relay/unix/:(socket)/101/main coio.c:349 E> SocketError: unexpected EOF when reading from socket, called on fd 27, aka unix/:(socket), peer of unix/:(socket): Broken pipe
2022-03-28 14:10:52.663 [76330] relay/unix/:(socket)/101/main I> exiting the relay loop
2022-03-28 14:10:52.948 [76330] main/114/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica3.sock I> RAFT: message {term: 4, vote: 1, state: candidate, vclock: {1: 9, 3: 3}} from 1
2022-03-28 14:10:52.948 [76330] main/114/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica3.sock I> RAFT: bump term to 4, follow
2022-03-28 14:10:52.948 [76330] main/114/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica3.sock I> RAFT: vote for 1, follow
2022-03-28 14:10:52.948 [76330] main/118/raft_worker I> RAFT: persisted state {term: 4, vote: 1}
2022-03-28 14:10:53.356 [76330] main/114/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica3.sock I> RAFT: message {term: 5, vote: 1, state: candidate, vclock: {1: 9, 3: 3}} from 1
2022-03-28 14:10:53.356 [76330] main/114/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica3.sock I> RAFT: bump term to 5, follow
2022-03-28 14:10:53.356 [76330] main/114/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica3.sock I> RAFT: vote for 1, follow
2022-03-28 14:10:53.356 [76330] main/118/raft_worker I> RAFT: persisted state {term: 5, vote: 1}
2022-03-28 14:10:53.767 [76330] main I> RAFT: begin new election round
2022-03-28 14:10:53.767 [76330] main I> RAFT: bump term to 6, follow
2022-03-28 14:10:53.767 [76330] main I> RAFT: vote for 2, follow 
2022-03-28 14:10:53.767 [76330] main/118/raft_worker I> RAFT: persisted state {term: 6, vote: 2}
2022-03-28 14:10:53.767 [76330] main/118/raft_worker I> RAFT: enter candidate state with 1 self vote
2022-03-28 14:10:53.781 [76330] main/114/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica3.sock I> RAFT: message {term: 6, vote: 1, state: candidate, vclock: {1: 9, 3: 3}} from 1
2022-03-28 14:10:53.781 [76330] main/114/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica3.sock I> RAFT: vote request is skipped - competing candidate
2022-03-28 14:10:54.179 [76330] main I> RAFT: begin new election round

replica3:
2022-03-28 14:10:49.513 [2385] main/116/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica2.sock I> can't read row
2022-03-28 14:10:49.513 [2385] main/116/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica2.sock xrow.c:229 E> ER_INVALID_MSGPACK: Invalid MsgPack - packet body
2022-03-28 14:10:49.587 [2385] main/127/main I> subscribed replica b5e078b0-b862-4d11-b726-7dba38835615 at fd 27, aka unix/:(socket), peer of unix/:(socket)
2022-03-28 14:10:49.587 [2385] main/127/main I> remote vclock {1: 9, 3: 3} local vclock {0: 3, 1: 9, 3: 3}
2022-03-28 14:10:49.588 [2385] main/106/gc I> wal/engine cleanup is resumed
2022-03-28 14:10:49.592 [2385] relay/unix/:(socket)/101/main I> recover from `/workspaces/t.sergos/test/var/083_replication/election_replica3/00000000000000000011.xlog'
2022-03-28 14:10:52.663 [2385] main/117/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica1.sock I> can't read row
2022-03-28 14:10:52.663 [2385] main/117/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica1.sock coio.c:349 E> SocketError: unexpected EOF when reading from socket, called on fd 28, aka unix/:(socket), peer of unix/:(socket): Broken pipe
2022-03-28 14:10:52.663 [2385] main/117/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica1.sock I> will retry every 0.10 second
2022-03-28 14:10:52.663 [2385] relay/unix/:(socket)/101/main coio.c:349 E> SocketError: unexpected EOF when reading from socket, called on fd 27, aka unix/:(socket), peer of unix/:(socket): Broken pipe
2022-03-28 14:10:52.663 [2385] relay/unix/:(socket)/101/main I> exiting the relay loop
2022-03-28 14:10:52.947 [2385] main I> RAFT: begin new election round
2022-03-28 14:10:52.947 [2385] main I> RAFT: bump term to 4, follow
2022-03-28 14:10:52.947 [2385] main I> RAFT: vote for 1, follow
2022-03-28 14:10:52.947 [2385] main/118/raft_worker I> RAFT: persisted state {term: 4, vote: 1}
2022-03-28 14:10:52.948 [2385] main/118/raft_worker I> RAFT: enter candidate state with 1 self vote
2022-03-28 14:10:53.355 [2385] main I> RAFT: begin new election round
2022-03-28 14:10:53.355 [2385] main I> RAFT: bump term to 5, follow
sergepetrenko commented 2 years ago

I've ran election_qsync_stress for quite some time, but couldn't reproduce the election hang that @sergos mentioned.

The ER_READONLY error was fixed recently in scope of tarantool/tarantool#6966

It seems that the case with infinite elections is caused by replica3's applier failing with ER_INVVALID_MSGPACK. This explains why two instances may enter the infinite election loop. replica3 doesn't receive vote requests from replica2 (its applier is dead), so replica2 never receives 2 votes. At the same time, replica3 never receives votes for itself from replica2, so replica3 never receives 2 votes. This results in infinite elections.

What causes the ER_INVALID_MSGPACK is a mystery though. It was fixed in scope of tarantool/tarantool#4040, so this might be another incarnation of this bug. @sergos, could you please set log_level=6 in election_replica.lua and try to reproduce this once again? With log_level=6 replica will dump the malformed row once ER_INVALID_MSGPACK occurs. This will give us a chance to understand what's causing the error.

sergos commented 2 years ago

I've got the error, but not the infinite elections. The reason surprises me: it's in relay, not applier:

2022-04-26 11:26:13.548 [53611] main/138/main I> set 'replication_synchro_timeout' configuration option to 1000
2022-04-26 11:26:13.717 [53611] main/113/main I> subscribed replica 38acc283-910b-40db-8ecc-e5ab45764352 at fd 23, aka unix/:(socket), peer of unix/:(socket)
2022-04-26 11:26:13.717 [53611] main/113/main I> remote vclock {1: 9} local vclock {0: 2, 1: 9, 3: 2}
2022-04-26 11:26:13.718 [53611] relay/unix/:(socket)/101/main I> recover from `/workspaces/t.sergos/test/var/027_replication/election_replica2/00000000000000000005.xlog'
2022-04-26 11:26:13.718 [53611] relay/unix/:(socket)/102/main:reader V> Got a corrupted row:
2022-04-26 11:26:13.718 [53611] relay/unix/:(socket)/102/main:reader V> 00000000: 81 00 00 81 26 81 01 09 03 01
2022-04-26 11:26:13.719 [53611] relay/unix/:(socket)/101/main xrow.c:229 E> ER_INVALID_MSGPACK: Invalid MsgPack - packet body
2022-04-26 11:26:13.719 [53611] relay/unix/:(socket)/101/main I> exiting the relay loop
2022-04-26 11:26:13.761 [53611] main/114/applier/unix/:/workspaces/t.sergos/test/var/027_replication/election_replica1.sock I> subscribed
2022-04-26 11:26:13.761 [53611] main/114/applier/unix/:/workspaces/t.sergos/test/var/027_replication/election_replica1.sock I> remote vclock {1: 9, 3: 2} local vclock {0: 2, 1: 9, 3: 2}
2022-04-26 11:26:13.761 [53611] main/114/applier/unix/:/workspaces/t.sergos/test/var/027_replication/election_replica1.sock I> RAFT: message {term: 3, leader: 3, leader is seen: true, state: follower} from 1

unfortunately, the xlog file is missing, even after I removed the test from the fragile list. I run tests on the dev1 box under ubuntu docker, using a bind mount type

        "Mounts": [
            {
                "Type": "bind",
                "Source": "/home/s.ostanevich/workspaces",
                "Destination": "/workspaces",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            }
        ],

I will try to reproduce without docker.

sergepetrenko commented 2 years ago

The reason surprises me: it's in relay, not applier:

Wow, that's something already. Will take a look.

sergos commented 2 years ago

Reproduced on plain el7.

2022-04-26 17:32:05.853 [36183] main/113/main I> subscribed replica 92a2b565-790f-4a2c-aa4e-2264a93a1d8e at fd 29, aka unix/:(socket), peer of unix/:(socket)
2022-04-26 17:32:05.854 [36183] main/113/main I> remote vclock {1: 5} local vclock {0: 1, 1: 5}
>>>> 2022-04-26 17:32:05.857 [36183] relay/unix/:(socket)/101/main I> recover from `/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica2/00000000000000000004.xlog'
2022-04-26 17:32:05.866 [36183] main/127/main I> subscribed replica 70f0ed78-3d15-4bcc-9fd0-87ec1ba7ea20 at fd 31, aka unix/:(socket), peer of unix/:(socket)
2022-04-26 17:32:05.866 [36183] main/127/main I> remote vclock {1: 5} local vclock {0: 1, 1: 5}
2022-04-26 17:32:05.866 [36183] main/127/main I> RAFT: fencing resumed
2022-04-26 17:32:05.867 [36183] main/106/gc I> wal/engine cleanup is resumed
>>>> 2022-04-26 17:32:05.868 [36183] relay/unix/:(socket)/101/main I> recover from `/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica2/00000000000000000004.xlog'
2022-04-26 17:32:05.911 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> subscribed
2022-04-26 17:32:05.911 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> remote vclock {1: 5} local vclock {0: 1, 1: 5}
2022-04-26 17:32:05.912 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: message {term: 2, leader: 1, leader is seen: true, state: follower} from 3
2022-04-26 17:32:05.914 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: message {term: 2, leader: 1, leader is seen: true, state: follower} from 3
2022-04-26 17:32:05.914 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> leaving orphan mode
2022-04-26 17:32:07.011 [36183] main/115/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica1.sock I> can't read row
2022-04-26 17:32:07.011 [36183] main/115/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica1.sock coio.c:349 E> SocketError: unexpected EOF when reading from socket, called on fd 25, aka unix/:(socket), peer of unix/:(socket): Broken pipe
2022-04-26 17:32:07.011 [36183] main/115/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica1.sock I> will retry every 0.10 second
>>>> 2022-04-26 17:32:07.011 [36183] relay/unix/:(socket)/101/main coio.c:349 E> SocketError: unexpected EOF when reading from socket, called on fd 31, aka unix/:(socket), peer of unix/:(socket): Broken pipe
2022-04-26 17:32:07.011 [36183] relay/unix/:(socket)/101/main I> exiting the relay loop
2022-04-26 17:32:07.353 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: message {term: 2, leader: 1, state: follower} from 3
2022-04-26 17:32:07.353 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: begin new election round
2022-04-26 17:32:07.353 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: bump term to 3, follow
2022-04-26 17:32:07.353 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: vote for 2, follow
2022-04-26 17:32:07.353 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: message {term: 3, vote: 3, state: candidate, vclock: {1: 9}} from 3
2022-04-26 17:32:07.353 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: vote request is skipped - already voted in this term
2022-04-26 17:32:07.353 [36183] main/118/raft_worker I> RAFT: persisted state {term: 3, vote: 2}
2022-04-26 17:32:07.353 [36183] main/118/raft_worker I> RAFT: enter candidate state with 1 self vote
2022-04-26 17:32:07.766 [36183] main I> RAFT: begin new election round
2022-04-26 17:32:07.766 [36183] main I> RAFT: bump term to 4, follow
2022-04-26 17:32:07.766 [36183] main I> RAFT: vote for 2, follow
2022-04-26 17:32:07.766 [36183] main/118/raft_worker I> RAFT: persisted state {term: 4, vote: 2}
2022-04-26 17:32:07.766 [36183] main/118/raft_worker I> RAFT: enter candidate state with 1 self vote
2022-04-26 17:32:07.767 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: message {term: 4, vote: 2, state: follower} from 3
2022-04-26 17:32:07.767 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: enter leader state with quorum 2
>>>> 2022-04-26 17:32:07.767 [36183] relay/unix/:(socket)/102/main:reader V> Got a corrupted row:
2022-04-26 17:32:07.767 [36183] relay/unix/:(socket)/102/main:reader V> 00000000: 81 00 00 81 26 81 01 09 02 01
2022-04-26 17:32:07.767 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: message {term: 4, vote: 2, leader: 2, leader is seen: true, state: follower} from 3
2022-04-26 17:32:07.767 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: vote request is skipped - the leader is already known - 2
>>>> 2022-04-26 17:32:07.767 [36183] relay/unix/:(socket)/101/main xrow.c:229 E> ER_INVALID_MSGPACK: Invalid MsgPack - packet body
2022-04-26 17:32:07.767 [36183] relay/unix/:(socket)/101/main I> exiting the relay loop
sergepetrenko commented 2 years ago

Found the issue together with @sergos: tarantool/tarantool#7089

sergepetrenko commented 2 years ago

Seems like we've fixed all the known issues with this test. No failures on my machine after a 1000 runs. Closing.