tarantool / tarantool-qa

QA related issues of Tarantool
3 stars 0 forks source link

test: flaky replication/box_set_replication_stress.test.lua test #200

Open avtikhon opened 4 years ago

avtikhon commented 4 years ago

Test replication/box_set_replication_stress.test.lua is flaky:

OS: always macOS arch: always x86_64

Screenshot 2022-10-26 at 19 22 56

Tarantool version: Tarantool 2.6.0-36-g29ec62891 Target: Darwin-x86_64-RelWithDebInfo Build options: cmake . -DCMAKE_INSTALL_PREFIX=/usr/local -DENABLE_BACKTRACE=ON Compiler: /Library/Developer/CommandLineTools/usr/bin/cc /Library/Developer/CommandLineTools/usr/bin/c++ C_FLAGS: -Wno-unknown-pragmas -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -msse2 -std=c11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-gnu-alignof-expression -Werror CXX_FLAGS: -Wno-unknown-pragmas -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -msse2 -std=c++11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-invalid-offsetof -Wno-gnu-alignof-expression -Werror

OS version: OSX

Bug description:

Issue:

--- replication/box_set_replication_stress.result   Fri Aug 14 18:28:41 2020
+++ var/004_replication/box_set_replication_stress.result   Sat Aug 15 15:19:44 2020
@@ -34,5 +34,3 @@

 -- Cleanup.
 test_run:drop_cluster(SERVERS)
- | ---
- | ...

Found that drop_cluster() routine from test-run repository failed to stop replica tarantool/tarantool#1 .

Steps to reproduce: c=0 ; while ./test-run.py -j 10 for t in {1..50} ; do echo replication/box_set_replication_stress.test.lua ; done ; do date ; c=$(($c+1)) ; echo PASSED: $c ; done | tee alx.log

Optional (but very desirable):

Totktonada commented 4 years ago

Related: https://github.com/tarantool/test-run/issues/157

sergos commented 2 years ago

A bit more details on the failure:

[011] replication/box_set_replication_stress.test.lua memtx           [ fail ]
[011] 
[011] Test failed! Result content mismatch:
[011] --- replication/box_set_replication_stress.result Tue Feb 15 18:12:56 2022
[011] +++ /tmp/t/rejects/replication/box_set_replication_stress.reject  Thu Nov 17 23:51:36 2022
[011] @@ -9,14 +9,19 @@
[011]  -- Deploy a cluster.
[011]  test_run:create_cluster(SERVERS)
[011]   | ---
[011] + | - error: '/private/tmp/t/011_replication/test_run.lua:21: Usage: yaml.decode(document,
[011] + |     [{tag_only = boolean}])'
[011]   | ...
[011]  test_run:wait_fullmesh(SERVERS)
[011]   | ---
[011] + | - error: '/private/tmp/t/011_replication/test_run.lua:21: Usage: yaml.decode(document,
[011] + |     [{tag_only = boolean}])'
[011]   | ...
[011]  
[011]  test_run:cmd("switch master_quorum1")
[011]   | ---
[011] - | - true
[011] + | - error: '/private/tmp/t/011_replication/test_run.lua:21: Usage: yaml.decode(document,
[011] + |     [{tag_only = boolean}])'
[011]   | ...
[011]  repl = box.cfg.replication
[011]   | ---
[011] @@ -29,10 +34,13 @@
[011]   | ...
[011]  test_run:cmd("switch default")
[011]   | ---
[011] - | - true
[011] + | - error: '/private/tmp/t/011_replication/test_run.lua:21: Usage: yaml.decode(document,
[011] + |     [{tag_only = boolean}])'
[011]   | ...
[011]  
[011]  -- Cleanup.
[011]  test_run:drop_cluster(SERVERS)
[011]   | ---
[011] + | - error: '/private/tmp/t/011_replication/test_run.lua:21: Usage: yaml.decode(document,
[011] + |     [{tag_only = boolean}])'
[011]   | ...
[011] 
[011] Last 15 lines of Tarantool Log file [Instance "master"][/tmp/t/011_replication/master.log]:
[011] 2022-11-17 23:51:35.892 [4723] snapshot/101/main I> done
[011] 2022-11-17 23:51:35.893 [4723] main/103/master I> ready to accept requests
[011] 2022-11-17 23:51:35.893 [4723] main/105/gc I> wal/engine cleanup is resumed
[011] 2022-11-17 23:51:35.893 [4723] main/103/master I> set 'replication_timeout' configuration option to 0.1
[011] 2022-11-17 23:51:35.893 [4723] main/103/master I> set 'replication_sync_timeout' configuration option to 100
[011] 2022-11-17 23:51:35.893 [4723] main/103/master I> set 'log' configuration option to "\/tmp\/t\/011_replication\/master.log"
[011] 2022-11-17 23:51:35.893 [4723] main/103/master I> set 'listen' configuration option to "\/tmp\/t\/011_replication\/master.i"
[011] 2022-11-17 23:51:35.893 [4723] main/103/master I> set 'memtx_memory' configuration option to 107374182
[011] 2022-11-17 23:51:35.893 [4723] main/106/checkpoint_daemon I> scheduled next checkpoint for Fri Nov 18 01:50:37 2022
[011] 2022-11-17 23:51:35.894 [4723] main/113/console/unix/:/tmp/t/011_replication/master.c I> started
[011] 2022-11-17 23:51:35.894 [4723] main I> entering the event loop
[011] 2022-11-17 23:51:35.917 [4723] main/115/console/unix/: I> starting full mesh
[011] 2022-11-17 23:51:35.917 [4723] main/115/console/unix/: I> master_quorum1: waiting bootstrap
[011] 2022-11-17 23:51:35.949 [4723] main/115/console/unix/: I> leaving orphan mode
[011] 2022-11-17 23:51:35.949 [4723] main/115/console/unix/: I> set 'replication' configuration option to []
[011] Exception: [Errno 8] nodename nor servname provided, or not known: ('localhost', 0)
---------------------------------------------------------------------------------

Socket is connected to self.host, self.port which are provided by inspector_new called with no arguments at the start of the test, hence obtaining the pair from env:

498 local function inspector_new(host, port)
499     local inspector = {}
500             
501     inspector.host = host or os.getenv('INSPECTOR_HOST') or 'localhost'
502     inspector.port = port or tonumber(os.getenv('INSPECTOR_PORT'))
503     if inspector.port == nil then
504         error('Inspector not started')
505     end
506     
507     return setmetatable(inspector, { __index = inspector_methods })
508 end

It is unreliable since port can be reused by another instance, which is dead by the time of reading from the socket.

sergos commented 1 year ago

Current status of CI shows the following single reason of failure:

[040] replication/box_set_replication_stress.test.lua vinyl           
[040] * [QA Notice]
[040] *
[040] * The server 'master_quorum1' does not stop during 5 seconds after the 15 (SIGTERM) signal.
[040] * Info: process 24563 [unknown; unknown]
[040] * Sending SIGKILL...
[040] *

Which should be resolved with 8023