openSUSE / Mojo-IOLoop-ReadWriteProcess

Execute external programs or internal code blocks as separate process
Other
10 stars 13 forks source link

Flaky error Non-zero wait status: 139 in t/02_parallel.t #40

Open foursixnine opened 2 years ago

foursixnine commented 2 years ago

So that it doesn't go under the table:

https://github.com/openSUSE/Mojo-IOLoop-ReadWriteProcess/runs/4472303650?check_suite_focus=true

Failed 1/15 test programs. 0/70 subtests failed. t/02_parallel.t (Wstat: 139 Tests: 1 Failed: 0) Non-zero wait status: 139 Parse errors: No plan found in TAP output Files=15, Tests=70, 256 wallclock secs ( 0.08 usr 0.04 sys + 135.23 cusr 10.99 csys = 146.34 CPU) Result: FAIL

cfconrad commented 2 years ago

I run it several times. Didn't found anything. Also from the code I don't see anything. There are a lot of sleeps in 02_parallel.t which are all not necessary, So I run 2 versions one with sleep and one without

COUNT: 62338 t/02_parallel_fast.t => Successful
COUNT: 3079 t/02_parallel.t  => Successful

so I don't know what it was on that CI run. My guess is, it came in connection with codecov, as also no test step failed but the script was executed != 0

UNFORTUANTLY I found a other issue

COUNT: 361 t/01_run.t => FAILED!! 9ba29d8

And I hit this also on osd https://build.opensuse.org/project/monitor/home:cfconrad:branches:devel:languages:perl?arch_aarch64=1&defaults=0&failed=1&repo_openSUSE_Tumbleweed=1 https://build.opensuse.org/project/monitor/home:cfconrad:branches:devel:languages:perl?arch_x86_64=1&defaults=0&failed=1&repo_openSUSE_Leap_15_0=1

cfconrad commented 2 years ago

When running in such error with verbose=1 and only that single test:

cover: running ./Build test "--extra_compiler_flags=-O0 -fprofile-arcs -ftest-coverage" "--extra_linker_flags=-fprofile-arcs -ftest-coverage"
t/02_parallel.t ..
# Subtest: parallel
    ok 1 - An object of class 'Mojo::IOLoop::ReadWriteProcess::Pool' isa 'Mojo::IOLoop::ReadWriteProcess::Pool'
    ok 2
    ok 3
    ok 4
    ok 5
    ok 6
    ok 7
    ok 8
    ok 9
    1..9
ok 1 - parallel
# Subtest: batch
    ok 1 - An object of class 'Mojo::IOLoop::ReadWriteProcess::Pool' isa 'Mojo::IOLoop::ReadWriteProcess::Pool'
    ok 2
    ok 3
    ok 4
    ok 5
    ok 6
All 1 subtests passed

Test Summary Report
-------------------
t/02_parallel.t (Wstat: 139 Tests: 1 Failed: 0)
  Non-zero wait status: 139
  Parse errors: No plan found in TAP output
Files=1, Tests=1,  9 wallclock secs ( 0.02 usr  0.01 sys +  0.63 cusr  0.27 csys =  0.93 CPU)
Result: FAIL
Failed 1/1 test programs. 0/1 subtests failed.
Reading database from /home/clemix/sandbox/Mojo-IOLoop-ReadWriteProcess/cover_db

-----------------------------------------------------------------------------
File
-----------------------------------------------------------------------------
Total
-----------------------------------------------------------------------------
cfconrad commented 2 years ago

A other error discovered during the WE:

Deleting database /home/clemix/sandbox/Mojo-IOLoop-ReadWriteProcess_2/cover_db
cover: running ./Build test "--extra_compiler_flags=-O0 -fprofile-arcs -ftest-coverage" "--extra_linker_flags=-fprofile-arcs -ftest-coverage"
t/02_parallel_fast.t .. 
# Subtest: parallel
    ok 1 - 'Is RWP::Pool' isa 'Mojo::IOLoop::ReadWriteProcess::Pool'
    ok 2 - Size is 4
Start
each read
    ok 3 - getline() success!
    ok 4 - getline() success!
    ok 5 - getline() success!
    ok 6 - getline() success!
    ok 7 - Fired process is 4
DO restart
double free or corruption (out)
No subtests run 

Test Summary Report
-------------------
t/02_parallel_fast.t (Wstat: 134 Tests: 0 Failed: 0)
  Non-zero wait status: 134
  Parse errors: No plan found in TAP output
Files=1, Tests=0,  2 wallclock secs ( 0.01 usr  0.01 sys +  0.55 cusr  0.14 csys =  0.71 CPU)
Result: FAIL
Failed 1/1 test programs. 0/0 subtests failed.
Reading database from /home/clemix/sandbox/Mojo-IOLoop-ReadWriteProcess_2/cover_db
cfconrad commented 2 years ago

I'm able to reproduce this (or almost this) locally now.

I guess we hit something like this https://github.com/cfconrad/Mojo-IOLoop-ReadWriteProcess/commit/1c27c4352e5a100987c22260539f2e7fbc6fa6ea , that our signal handler interrupt our code in unconditional state and s**t happens...

I was able to reproduce the error quite good with: https://github.com/cfconrad/Mojo-IOLoop-ReadWriteProcess/commit/d84d4eed29a25ea82183c90dbd8a44b7da992dbb I get the error always with less then 200 attempts and mostly around ~50.

But if I set emit_from_sigchld => 0 I do not hit the error within 25867 attempts. Maybe we should reconsider the default of that variable. (https://github.com/cfconrad/Mojo-IOLoop-ReadWriteProcess/commit/cdba6076f51835a21a5767159a1c69cf1d349f5e)

I didn't found the root cause of the error. But getting a double free or corruption (out) sounds really like some interruption glitch here and there...

@Martchus would you like to verify my observations ? Would be nice, so I do not see ghosts!