roc_lib test panics under valgring

dshil commented 5 years ago

After #275 (#223) roc_lib tests were enabled. In very rare cases (at very high system load) it is possible to get the panic in TEST(sender_receiver, fec_with_losses):

src/tests/roc_lib/test_sender_receiver.cpp:211: error: roc_panic()

ERROR: roc_test: !(leading_zeros < Timeout)

The panic signalizes the fact that we don't get a sample in the specified timeout:

How to reproduce

Use the following script to run tests under valgrind:

while :; do python scripts/wrappers/timeout.py 300 \
  valgrind --max-stackframe=10475520 --error-exitcode=1 --exit-on-first-error=yes \
    bin/x86_64-pc-linux-gnu/roc-test-lib || break; done

Note, that the panic is not reproduced on each script run. Try to load the system and you will be more lucky to catch the panic.

Solution

Sender sends a repetitive signal in a loop (the signal can be represented as a set of constantly increasing values, where the next value equals to the previous value + 1, taking into account a possible overflow)
Receiver waits for a first non-zero sample and reads the next N samples each of which can be either a zero or non-zero sample
Receiver ensures that there were no more than 10% of losses for the N received samples (we already have this functionality)
Receiver uses long timeout (e.g., 1 minute) to ensure that the test won't run forever

Relates: #277

dshil commented 5 years ago

The panic itself:

TEST(sender_receiver, fec_with_losses)
14:34:50.633 [inf] roc_lib: roc_context: opening context
14:34:50.729 [dbg] roc_lib: pool: initializing: object_size=672 poison=0
14:34:50.730 [dbg] roc_lib: pool: initializing: object_size=2064 poison=0
14:34:50.732 [dbg] roc_lib: pool: initializing: object_size=4112 poison=0
14:34:50.844 [dbg] roc_netio: transceiver: starting event loop
14:34:50.865 [inf] roc_lib: roc_receiver: opening receiver
14:34:50.872 [dbg] roc_audio: mixer: initializing: frame_size=640
14:34:50.923 [inf] roc_netio: udp receiver: opened port 127.0.0.1:48691
14:34:50.936 [inf] roc_pipeline: receiver: adding port rtp+rs8m:127.0.0.1:48691
14:34:50.941 [inf] roc_lib: roc_receiver: bound to rtp+rs8m:127.0.0.1:48691
14:34:50.944 [inf] roc_netio: udp receiver: opened port 127.0.0.1:59226
14:34:50.945 [inf] roc_pipeline: receiver: adding port rs8m:127.0.0.1:59226
14:34:50.945 [inf] roc_lib: roc_receiver: bound to rs8m:127.0.0.1:59226
14:34:51.027 [dbg] roc_netio: transceiver: starting event loop
14:34:51.059 [inf] roc_netio: udp sender: opened port 127.0.0.1:58144
14:34:51.068 [inf] roc_netio: udp receiver: opened port 127.0.0.1:37782
14:34:51.071 [inf] roc_netio: udp receiver: opened port 127.0.0.1:52198
14:34:51.075 [inf] roc_lib: roc_sender: opening sender
14:34:51.087 [inf] roc_netio: udp sender: opened port 127.0.0.1:44384
14:34:51.090 [inf] roc_lib: roc_sender: bound to 127.0.0.1:44384
14:34:51.095 [inf] roc_lib: roc_sender: set audio source port to rtp+rs8m:127.0.0.1:37782
14:34:51.099 [inf] roc_lib: roc_sender: set audio repair port to rs8m:127.0.0.1:52198
14:34:51.180 [inf] roc_pipeline: sender: using remote source port rtp+rs8m:127.0.0.1:37782
14:34:51.181 [inf] roc_pipeline: sender: using remote repair port rs8m:127.0.0.1:52198
14:34:51.192 [dbg] roc_fec: of encoder: initializing: codec=rs m=8
14:34:51.209 [dbg] roc_fec: fec writer: update block size: cur_sbl=0 cur_rbl=0 new_sbl=10 new_rbl=5
14:34:51.214 [dbg] roc_audio: packetizer: initializing: n_channels=2 samples_per_packet=50
14:34:51.274 [dbg] roc_packet: router: detected new stream: source=1178196737 flags=0x8u
14:34:51.378 [dbg] roc_packet: router: detected new stream: source=0 flags=0x10u
14:34:51.438 [inf] roc_pipeline: receiver: creating session: src_addr=127.0.0.1:58144 dst_addr=127.0.0.1:48691
14:34:51.443 [dbg] roc_packet: delayed reader: initializing: delay=1500
14:34:51.447 [dbg] roc_fec: of decoder: initializing: codec=rs m=8
14:34:51.453 [dbg] roc_audio: depacketizer: initializing: n_channels=2
14:34:51.458 [dbg] roc_audio: watchdog: initializing: max_blank_duration=30000 max_drops_duration=0 drop_detection_window=13230
14:34:51.468 [dbg] roc_audio: latency monitor: initializing: target_latency=1500 in_rate=44100 out_rate=44100
14:34:51.470 [dbg] roc_packet: router: detected new stream: source=1178196737 flags=0x8u
14:34:51.517 [dbg] roc_audio: depacketizer: ts=100 loss_ratio=0.00000
14:34:51.530 [dbg] roc_audio: watchdog: status: bbbbbbbbbbbbbbbbbbbb
14:34:51.534 [dbg] roc_audio: watchdog: status: bbbbbbbbbbbbbbbbbbbb

src/tests/roc_lib/test_sender_receiver.cpp:211: error: roc_panic()

ERROR: roc_test: !(leading_zeros < Timeout)

==911228== Syscall param msync(start) points to uninitialised byte(s)
==911228==    at 0x49550AF: msync (in /usr/lib/libpthread-2.29.so)
==911228==    by 0x11E778: access_mem (in /home/dshil/dev/roc/bin/x86_64-pc-linux-gnu/roc-test-lib)
==911228==    by 0x11FD85: apply_reg_state (in /home/dshil/dev/roc/bin/x86_64-pc-linux-gnu/roc-test-lib)
==911228==    by 0x11F601: _ULx86_64_dwarf_find_save_locs (in /home/dshil/dev/roc/bin/x86_64-pc-linux-gnu/roc-test-lib)
==911228==    by 0x11E218: _ULx86_64_dwarf_step (in /home/dshil/dev/roc/bin/x86_64-pc-linux-gnu/roc-test-lib)
==911228==    by 0x11D5DD: _ULx86_64_step (in /home/dshil/dev/roc/bin/x86_64-pc-linux-gnu/roc-test-lib)
==911228==    by 0x11C1DF: roc::core::print_backtrace() (in /home/dshil/dev/roc/bin/x86_64-pc-linux-gnu/roc-test-lib)
==911228==    by 0x11B83A: roc::core::crash(char const*) (in /home/dshil/dev/roc/bin/x86_64-pc-linux-gnu/roc-test-lib)
==911228==    by 0x11BCCC: roc::core::panic(char const*, char const*, int, char const*, ...) (in /home/dshil/dev/roc/bin/x86_64-pc-linux-gnu/roc-test-lib)
==911228==    by 0x113667: roc::(anonymous namespace)::Receiver::run() (in /home/dshil/dev/roc/bin/x86_64-pc-linux-gnu/roc-test-lib)
==911228==    by 0x114064: roc::TEST_sender_receiver_fec_with_losses_Test::testBody() (in /home/dshil/dev/roc/bin/x86_64-pc-linux-gnu/roc-test-lib)
==911228==    by 0x1314D3: helperDoTestBody(void*) (in /home/dshil/dev/roc/bin/x86_64-pc-linux-gnu/roc-test-lib)
==911228==  Address 0x1ffeffe014 is on thread 1's stack
==911228==  in frame #6, created by roc::core::print_backtrace() (???:)
==911228== 
==911228== 
==911228== Exit program on first error (--exit-on-first-error=yes)

dshil commented 5 years ago

FYI @alexandremgo

alexandremgo commented 5 years ago

I'll look into it on Monday (don't have access to my computer right now) ;)

alexandremgo commented 5 years ago

Hey!

I was able to reproduce the error only by decreasing Timeout (Timeout = TotalSamples * 2 for ex) Otherwise the loop would run into the following error after a while:

pure virtual method called
terminate called without an active exception

ERROR: caught SIGABRT

==13457== Thread 3:
==13457== Syscall param msync(start) points to uninitialised byte(s)

Maybe TEST(sender_receiver, fec_with_losses) is not called in my case ? How does ROC_TARGET_OPENFEC is defined ?

My build command:

scons -Q --enable-pulseaudio-modules --build-3rdparty=openfec,pulseaudio --enable-werror --enable-debug

dshil commented 5 years ago

@alexandremgo

Thanks for coming back to this issue!

Otherwise the loop would run into the following error after a while:

It seems strange. @gavv have you ever encountered this type of error?

is not called in my case

It is called every time because you built with the OpenFEC support, as a result ROC_TARGET_OPENFEC is defined by the build system.

We have the option to disable this support: --disable-openfec disable OpenFEC support required for FEC codes. If you disable the OpenFEC support tests related to OpenFEC won't be included.

gavv commented 5 years ago

I was able to reproduce the error only by decreasing Timeout (Timeout = TotalSamples * 2 for ex)

Maybe @dshil's machine was more loaded during the test or the hardware is different... Anyway making this test adaptive and not dependent on the specific timeout (as described in the issue) is a good idea I think.

Otherwise the loop would run into the following error after a while

This is something new. It seems that this test needs much more love :)

Do you have a backtrace? Probably you could capture it using libSegFault.so with LD_PRELOAD.

I'd expect that Roc will print backtrace on terminate because it raises SIGABRT. Does it print it when running without valgrind? If it doesn't, we should open a separate issue to fix this. And we likely should also open a separate issue for the problem itself (pure virtual function call in test).

Maybe TEST(sender_receiver, fec_with_losses) is not called in my case ?

Here is how you can see what tests are running:

./bin/x86_64-pc-linux-gnu/roc-test-lib -v

(and also see logs)

and here is how you can run an individual test group:

./bin/x86_64-pc-linux-gnu/roc-test-lib -g sender_receiver

or individual test:

 ./bin/x86_64-pc-linux-gnu/roc-test-lib -g sender_receiver -n fec_with_losses

gavv commented 5 years ago

BTW I can reproduce the original issue (!(leading_zeros < Timeout)) by running 6 instances of the loop in parallel. I have 2-core CPU with hyper-threading.

gavv commented 5 years ago

Receiver waits for a first non-zero sample and reads the next N samples each of which can be either a zero or non-zero sample

Receiver ensures that there were no more than 10% of losses for the N received samples (we already have this functionality)

We can relax the requirements even further. Instead of reading N samples and expecting that 90% will be non-zero, we can just indefinitely read samples until we accumulate N non-zero samples. (But we should keep the check that the each sample is either correct on zero.)

This way the test should become tolerant to the system load, I hope.

Later we will add more strict tests that will check service quality and latency. But those tests will not be intended for travis and valgrind. I think we will run them on hardware.

@alexandremgo Are you working on this issue / have plans to work on it?

alexandremgo commented 5 years ago

Receiver waits for a first non-zero sample and reads the next N samples each of which can be either a zero or non-zero sample

Receiver ensures that there were no more than 10% of losses for the N received samples (we already have this functionality)

We can relax the requirements even further. Instead of reading N samples and expecting that 90% will be non-zero, we can just indefinitely read samples until we accumulate N non-zero samples. (But we should keep the check that the each sample is either correct on zero.)

Yes i agree it would relax the requirements. (We agree you meant: we should keep the check that each sample is either correct OR zero ?). To do so, should the Sender send a repetitive signal in a loop as mentioned by @dshil ?

@alexandremgo Are you working on this issue / have plans to work on it?

Yes I'm working on it ;)

alexandremgo commented 5 years ago

This is something new. It seems that this test needs much more love :)

Do you have a backtrace? Probably you could capture it using libSegFault.so with LD_PRELOAD.

I'd expect that Roc will print backtrace on terminate because it raises SIGABRT. Does it print it when running without valgrind? If it doesn't, we should open a separate issue to fix this. And we likely should also open a separate issue for the problem itself (pure virtual function call in test).

Strangely enough i did not manage to reproduce this error this morning. I'll come back to it if it appears again.

gavv commented 5 years ago

We agree you meant: we should keep the check that each sample is either correct OR zero ?

Yep. A typo.

To do so, should the Sender send a repetitive signal in a loop as mentioned by @dshil ?

Yep.

Yes I'm working on it ;)

Cool!

Strangely enough i did not manage to reproduce this error this morning. I'll come back to it if it appears again.

OK.

alexandremgo commented 5 years ago

I have several questions on #283 :)

gavv commented 4 years ago

Fixed. Valgrind is enabled in travis .

gavv commented 4 years ago

Later we will add more strict tests that will check service quality and latency. But those tests will not be intended for travis and valgrind. I think we will run them on hardware.

I've created roc-tests repo and a few help-wanted issues: #296, #297, #298.

roc-streaming / roc-toolkit

roc_lib test panics under valgring #278

How to reproduce

Solution