Open avtikhon opened 4 years ago
A bit more details on the failure:
[011] replication/box_set_replication_stress.test.lua memtx [ fail ]
[011]
[011] Test failed! Result content mismatch:
[011] --- replication/box_set_replication_stress.result Tue Feb 15 18:12:56 2022
[011] +++ /tmp/t/rejects/replication/box_set_replication_stress.reject Thu Nov 17 23:51:36 2022
[011] @@ -9,14 +9,19 @@
[011] -- Deploy a cluster.
[011] test_run:create_cluster(SERVERS)
[011] | ---
[011] + | - error: '/private/tmp/t/011_replication/test_run.lua:21: Usage: yaml.decode(document,
[011] + | [{tag_only = boolean}])'
[011] | ...
[011] test_run:wait_fullmesh(SERVERS)
[011] | ---
[011] + | - error: '/private/tmp/t/011_replication/test_run.lua:21: Usage: yaml.decode(document,
[011] + | [{tag_only = boolean}])'
[011] | ...
[011]
[011] test_run:cmd("switch master_quorum1")
[011] | ---
[011] - | - true
[011] + | - error: '/private/tmp/t/011_replication/test_run.lua:21: Usage: yaml.decode(document,
[011] + | [{tag_only = boolean}])'
[011] | ...
[011] repl = box.cfg.replication
[011] | ---
[011] @@ -29,10 +34,13 @@
[011] | ...
[011] test_run:cmd("switch default")
[011] | ---
[011] - | - true
[011] + | - error: '/private/tmp/t/011_replication/test_run.lua:21: Usage: yaml.decode(document,
[011] + | [{tag_only = boolean}])'
[011] | ...
[011]
[011] -- Cleanup.
[011] test_run:drop_cluster(SERVERS)
[011] | ---
[011] + | - error: '/private/tmp/t/011_replication/test_run.lua:21: Usage: yaml.decode(document,
[011] + | [{tag_only = boolean}])'
[011] | ...
[011]
[011] Last 15 lines of Tarantool Log file [Instance "master"][/tmp/t/011_replication/master.log]:
[011] 2022-11-17 23:51:35.892 [4723] snapshot/101/main I> done
[011] 2022-11-17 23:51:35.893 [4723] main/103/master I> ready to accept requests
[011] 2022-11-17 23:51:35.893 [4723] main/105/gc I> wal/engine cleanup is resumed
[011] 2022-11-17 23:51:35.893 [4723] main/103/master I> set 'replication_timeout' configuration option to 0.1
[011] 2022-11-17 23:51:35.893 [4723] main/103/master I> set 'replication_sync_timeout' configuration option to 100
[011] 2022-11-17 23:51:35.893 [4723] main/103/master I> set 'log' configuration option to "\/tmp\/t\/011_replication\/master.log"
[011] 2022-11-17 23:51:35.893 [4723] main/103/master I> set 'listen' configuration option to "\/tmp\/t\/011_replication\/master.i"
[011] 2022-11-17 23:51:35.893 [4723] main/103/master I> set 'memtx_memory' configuration option to 107374182
[011] 2022-11-17 23:51:35.893 [4723] main/106/checkpoint_daemon I> scheduled next checkpoint for Fri Nov 18 01:50:37 2022
[011] 2022-11-17 23:51:35.894 [4723] main/113/console/unix/:/tmp/t/011_replication/master.c I> started
[011] 2022-11-17 23:51:35.894 [4723] main I> entering the event loop
[011] 2022-11-17 23:51:35.917 [4723] main/115/console/unix/: I> starting full mesh
[011] 2022-11-17 23:51:35.917 [4723] main/115/console/unix/: I> master_quorum1: waiting bootstrap
[011] 2022-11-17 23:51:35.949 [4723] main/115/console/unix/: I> leaving orphan mode
[011] 2022-11-17 23:51:35.949 [4723] main/115/console/unix/: I> set 'replication' configuration option to []
[011] Exception: [Errno 8] nodename nor servname provided, or not known: ('localhost', 0)
---------------------------------------------------------------------------------
Socket is connected to self.host, self.port
which are provided by inspector_new
called with no arguments at the start of the test, hence obtaining the pair from env:
498 local function inspector_new(host, port)
499 local inspector = {}
500
501 inspector.host = host or os.getenv('INSPECTOR_HOST') or 'localhost'
502 inspector.port = port or tonumber(os.getenv('INSPECTOR_PORT'))
503 if inspector.port == nil then
504 error('Inspector not started')
505 end
506
507 return setmetatable(inspector, { __index = inspector_methods })
508 end
It is unreliable since port can be reused by another instance, which is dead by the time of reading from the socket.
Current status of CI shows the following single reason of failure:
[040] replication/box_set_replication_stress.test.lua vinyl
[040] * [QA Notice]
[040] *
[040] * The server 'master_quorum1' does not stop during 5 seconds after the 15 (SIGTERM) signal.
[040] * Info: process 24563 [unknown; unknown]
[040] * Sending SIGKILL...
[040] *
Which should be resolved with 8023
Test replication/box_set_replication_stress.test.lua is flaky:
OS: always macOS arch: always x86_64
Tarantool version: Tarantool 2.6.0-36-g29ec62891 Target: Darwin-x86_64-RelWithDebInfo Build options: cmake . -DCMAKE_INSTALL_PREFIX=/usr/local -DENABLE_BACKTRACE=ON Compiler: /Library/Developer/CommandLineTools/usr/bin/cc /Library/Developer/CommandLineTools/usr/bin/c++ C_FLAGS: -Wno-unknown-pragmas -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -msse2 -std=c11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-gnu-alignof-expression -Werror CXX_FLAGS: -Wno-unknown-pragmas -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -msse2 -std=c++11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-invalid-offsetof -Wno-gnu-alignof-expression -Werror
OS version: OSX
Bug description:
Issue:
Found that drop_cluster() routine from test-run repository failed to stop replica tarantool/tarantool#1 .
Steps to reproduce: c=0 ; while ./test-run.py -j 10
for t in {1..50} ; do echo replication/box_set_replication_stress.test.lua ; done
; do date ; c=$(($c+1)) ; echo PASSED: $c ; done | tee alx.logOptional (but very desirable):