Closed dshil closed 4 years ago
The panic itself:
TEST(sender_receiver, fec_with_losses)
14:34:50.633 [inf] roc_lib: roc_context: opening context
14:34:50.729 [dbg] roc_lib: pool: initializing: object_size=672 poison=0
14:34:50.730 [dbg] roc_lib: pool: initializing: object_size=2064 poison=0
14:34:50.732 [dbg] roc_lib: pool: initializing: object_size=4112 poison=0
14:34:50.844 [dbg] roc_netio: transceiver: starting event loop
14:34:50.865 [inf] roc_lib: roc_receiver: opening receiver
14:34:50.872 [dbg] roc_audio: mixer: initializing: frame_size=640
14:34:50.923 [inf] roc_netio: udp receiver: opened port 127.0.0.1:48691
14:34:50.936 [inf] roc_pipeline: receiver: adding port rtp+rs8m:127.0.0.1:48691
14:34:50.941 [inf] roc_lib: roc_receiver: bound to rtp+rs8m:127.0.0.1:48691
14:34:50.944 [inf] roc_netio: udp receiver: opened port 127.0.0.1:59226
14:34:50.945 [inf] roc_pipeline: receiver: adding port rs8m:127.0.0.1:59226
14:34:50.945 [inf] roc_lib: roc_receiver: bound to rs8m:127.0.0.1:59226
14:34:51.027 [dbg] roc_netio: transceiver: starting event loop
14:34:51.059 [inf] roc_netio: udp sender: opened port 127.0.0.1:58144
14:34:51.068 [inf] roc_netio: udp receiver: opened port 127.0.0.1:37782
14:34:51.071 [inf] roc_netio: udp receiver: opened port 127.0.0.1:52198
14:34:51.075 [inf] roc_lib: roc_sender: opening sender
14:34:51.087 [inf] roc_netio: udp sender: opened port 127.0.0.1:44384
14:34:51.090 [inf] roc_lib: roc_sender: bound to 127.0.0.1:44384
14:34:51.095 [inf] roc_lib: roc_sender: set audio source port to rtp+rs8m:127.0.0.1:37782
14:34:51.099 [inf] roc_lib: roc_sender: set audio repair port to rs8m:127.0.0.1:52198
14:34:51.180 [inf] roc_pipeline: sender: using remote source port rtp+rs8m:127.0.0.1:37782
14:34:51.181 [inf] roc_pipeline: sender: using remote repair port rs8m:127.0.0.1:52198
14:34:51.192 [dbg] roc_fec: of encoder: initializing: codec=rs m=8
14:34:51.209 [dbg] roc_fec: fec writer: update block size: cur_sbl=0 cur_rbl=0 new_sbl=10 new_rbl=5
14:34:51.214 [dbg] roc_audio: packetizer: initializing: n_channels=2 samples_per_packet=50
14:34:51.274 [dbg] roc_packet: router: detected new stream: source=1178196737 flags=0x8u
14:34:51.378 [dbg] roc_packet: router: detected new stream: source=0 flags=0x10u
14:34:51.438 [inf] roc_pipeline: receiver: creating session: src_addr=127.0.0.1:58144 dst_addr=127.0.0.1:48691
14:34:51.443 [dbg] roc_packet: delayed reader: initializing: delay=1500
14:34:51.447 [dbg] roc_fec: of decoder: initializing: codec=rs m=8
14:34:51.453 [dbg] roc_audio: depacketizer: initializing: n_channels=2
14:34:51.458 [dbg] roc_audio: watchdog: initializing: max_blank_duration=30000 max_drops_duration=0 drop_detection_window=13230
14:34:51.468 [dbg] roc_audio: latency monitor: initializing: target_latency=1500 in_rate=44100 out_rate=44100
14:34:51.470 [dbg] roc_packet: router: detected new stream: source=1178196737 flags=0x8u
14:34:51.517 [dbg] roc_audio: depacketizer: ts=100 loss_ratio=0.00000
14:34:51.530 [dbg] roc_audio: watchdog: status: bbbbbbbbbbbbbbbbbbbb
14:34:51.534 [dbg] roc_audio: watchdog: status: bbbbbbbbbbbbbbbbbbbb
src/tests/roc_lib/test_sender_receiver.cpp:211: error: roc_panic()
ERROR: roc_test: !(leading_zeros < Timeout)
==911228== Syscall param msync(start) points to uninitialised byte(s)
==911228== at 0x49550AF: msync (in /usr/lib/libpthread-2.29.so)
==911228== by 0x11E778: access_mem (in /home/dshil/dev/roc/bin/x86_64-pc-linux-gnu/roc-test-lib)
==911228== by 0x11FD85: apply_reg_state (in /home/dshil/dev/roc/bin/x86_64-pc-linux-gnu/roc-test-lib)
==911228== by 0x11F601: _ULx86_64_dwarf_find_save_locs (in /home/dshil/dev/roc/bin/x86_64-pc-linux-gnu/roc-test-lib)
==911228== by 0x11E218: _ULx86_64_dwarf_step (in /home/dshil/dev/roc/bin/x86_64-pc-linux-gnu/roc-test-lib)
==911228== by 0x11D5DD: _ULx86_64_step (in /home/dshil/dev/roc/bin/x86_64-pc-linux-gnu/roc-test-lib)
==911228== by 0x11C1DF: roc::core::print_backtrace() (in /home/dshil/dev/roc/bin/x86_64-pc-linux-gnu/roc-test-lib)
==911228== by 0x11B83A: roc::core::crash(char const*) (in /home/dshil/dev/roc/bin/x86_64-pc-linux-gnu/roc-test-lib)
==911228== by 0x11BCCC: roc::core::panic(char const*, char const*, int, char const*, ...) (in /home/dshil/dev/roc/bin/x86_64-pc-linux-gnu/roc-test-lib)
==911228== by 0x113667: roc::(anonymous namespace)::Receiver::run() (in /home/dshil/dev/roc/bin/x86_64-pc-linux-gnu/roc-test-lib)
==911228== by 0x114064: roc::TEST_sender_receiver_fec_with_losses_Test::testBody() (in /home/dshil/dev/roc/bin/x86_64-pc-linux-gnu/roc-test-lib)
==911228== by 0x1314D3: helperDoTestBody(void*) (in /home/dshil/dev/roc/bin/x86_64-pc-linux-gnu/roc-test-lib)
==911228== Address 0x1ffeffe014 is on thread 1's stack
==911228== in frame #6, created by roc::core::print_backtrace() (???:)
==911228==
==911228==
==911228== Exit program on first error (--exit-on-first-error=yes)
FYI @alexandremgo
I'll look into it on Monday (don't have access to my computer right now) ;)
Hey!
I was able to reproduce the error only by decreasing Timeout
(Timeout = TotalSamples * 2
for ex)
Otherwise the loop would run into the following error after a while:
pure virtual method called
terminate called without an active exception
ERROR: caught SIGABRT
==13457== Thread 3:
==13457== Syscall param msync(start) points to uninitialised byte(s)
Maybe TEST(sender_receiver, fec_with_losses)
is not called in my case ? How does ROC_TARGET_OPENFEC
is defined ?
My build command:
scons -Q --enable-pulseaudio-modules --build-3rdparty=openfec,pulseaudio --enable-werror --enable-debug
@alexandremgo
Thanks for coming back to this issue!
Otherwise the loop would run into the following error after a while:
It seems strange. @gavv have you ever encountered this type of error?
is not called in my case
It is called every time because you built with the OpenFEC support, as a result ROC_TARGET_OPENFEC is defined by the build system.
We have the option to disable this support: --disable-openfec disable OpenFEC support required for FEC codes
. If you disable the OpenFEC support tests related to OpenFEC won't be included.
I was able to reproduce the error only by decreasing Timeout (Timeout = TotalSamples * 2 for ex)
Maybe @dshil's machine was more loaded during the test or the hardware is different... Anyway making this test adaptive and not dependent on the specific timeout (as described in the issue) is a good idea I think.
Otherwise the loop would run into the following error after a while
This is something new. It seems that this test needs much more love :)
Do you have a backtrace? Probably you could capture it using libSegFault.so with LD_PRELOAD.
I'd expect that Roc will print backtrace on terminate because it raises SIGABRT. Does it print it when running without valgrind? If it doesn't, we should open a separate issue to fix this. And we likely should also open a separate issue for the problem itself (pure virtual function call in test).
Maybe TEST(sender_receiver, fec_with_losses) is not called in my case ?
Here is how you can see what tests are running:
./bin/x86_64-pc-linux-gnu/roc-test-lib -v
(and also see logs)
and here is how you can run an individual test group:
./bin/x86_64-pc-linux-gnu/roc-test-lib -g sender_receiver
or individual test:
./bin/x86_64-pc-linux-gnu/roc-test-lib -g sender_receiver -n fec_with_losses
BTW I can reproduce the original issue (!(leading_zeros < Timeout)
) by running 6 instances of the loop in parallel. I have 2-core CPU with hyper-threading.
Receiver waits for a first non-zero sample and reads the next N samples each of which can be either a zero or non-zero sample
Receiver ensures that there were no more than 10% of losses for the N received samples (we already have this functionality)
We can relax the requirements even further. Instead of reading N samples and expecting that 90% will be non-zero, we can just indefinitely read samples until we accumulate N non-zero samples. (But we should keep the check that the each sample is either correct on zero.)
This way the test should become tolerant to the system load, I hope.
Later we will add more strict tests that will check service quality and latency. But those tests will not be intended for travis and valgrind. I think we will run them on hardware.
@alexandremgo Are you working on this issue / have plans to work on it?
Receiver waits for a first non-zero sample and reads the next N samples each of which can be either a zero or non-zero sample
Receiver ensures that there were no more than 10% of losses for the N received samples (we already have this functionality)
We can relax the requirements even further. Instead of reading N samples and expecting that 90% will be non-zero, we can just indefinitely read samples until we accumulate N non-zero samples. (But we should keep the check that the each sample is either correct on zero.)
Yes i agree it would relax the requirements. (We agree you meant: we should keep the check that each sample is either correct OR zero ?). To do so, should the Sender send a repetitive signal in a loop as mentioned by @dshil ?
@alexandremgo Are you working on this issue / have plans to work on it?
Yes I'm working on it ;)
This is something new. It seems that this test needs much more love :)
Do you have a backtrace? Probably you could capture it using libSegFault.so with LD_PRELOAD.
I'd expect that Roc will print backtrace on terminate because it raises SIGABRT. Does it print it when running without valgrind? If it doesn't, we should open a separate issue to fix this. And we likely should also open a separate issue for the problem itself (pure virtual function call in test).
Strangely enough i did not manage to reproduce this error this morning. I'll come back to it if it appears again.
We agree you meant: we should keep the check that each sample is either correct OR zero ?
Yep. A typo.
To do so, should the Sender send a repetitive signal in a loop as mentioned by @dshil ?
Yep.
Yes I'm working on it ;)
Cool!
Strangely enough i did not manage to reproduce this error this morning. I'll come back to it if it appears again.
OK.
I have several questions on #283 :)
Fixed. Valgrind is enabled in travis .
After #275 (#223) roc_lib tests were enabled. In very rare cases (at very high system load) it is possible to get the panic in
TEST(sender_receiver, fec_with_losses)
:The panic signalizes the fact that we don't get a sample in the specified timeout:
How to reproduce
Use the following script to run tests under valgrind:
Note, that the panic is not reproduced on each script run. Try to load the system and you will be more lucky to catch the panic.
Solution
Sender sends a repetitive signal in a loop (the signal can be represented as a set of constantly increasing values, where the next value equals to the previous value + 1, taking into account a possible overflow)
Receiver waits for a first non-zero sample and reads the next N samples each of which can be either a zero or non-zero sample
Receiver ensures that there were no more than 10% of losses for the N received samples (we already have this functionality)
Receiver uses long timeout (e.g., 1 minute) to ensure that the test won't run forever
Relates: #277