Closed vks closed 8 years ago
Thanks @vks for the report! This seems a bit odd as CI builds are reporting successful on both Linux and Windows.
I'll dig a little deeper in this shortly.
The failing test is survey
. I added a timeout on the send & recv side of each socket so we can see where the test is blocked.
Now I'm getting this:
► env RUST_BACKTRACE=1 cargo test
[...]
test tests::pubsub ... ok
test tests::survey ... FAILED
failures:
---- tests::survey stdout ----
thread 'tests::survey' panicked at 'Connection timed out', src/lib.rs:1334
stack backtrace:
1: 0x555a28fb8a00 - sys::backtrace::tracing::imp::write::h578d6e47592a4041cHv
2: 0x555a28fbb3bb - panicking::default_handler::_$u7b$$u7b$closure$u7d$$u7d$::closure.44995
3: 0x555a28fbaf73 - panicking::default_handler::hffa00c845e94c6581nA
4: 0x555a28faea1c - sys_common::unwind::begin_unwind_inner::hd824946792143cc8vvu
5: 0x555a28faeea8 - sys_common::unwind::begin_unwind_fmt::h3777a045191296c5Buu
6: 0x555a28f653d3 - tests::test_read::h8537325e396e4b62rFb
at ~/src/nanomsg.rs/<std macros>:8
7: 0x555a28f6d2b1 - tests::survey::h4b21b688e15f24d7Z6b
at src/lib.rs:1645
8: 0x555a28f91a76 - boxed::F.FnBox<A>::call_box::h17494710956627492744
9: 0x555a28f9416b - sys_common::unwind::try::try_fn::h9823696487615081863
10: 0x555a28fb7dfb - __rust_try
11: 0x555a28fb510d - sys_common::unwind::inner_try::h8ac2089dadb4381bxsu
12: 0x555a28f944ea - boxed::F.FnBox<A>::call_box::h14092234691203767508
13: 0x555a28fb9e69 - sys::thread::Thread::new::thread_start::hd111df90784ef3b2llz
14: 0x7f6f8c04e423 - start_thread
15: 0x7f6f8bb77cbc - clone
16: 0x0 - <unknown>
failures:
tests::survey
test result: FAILED. 41 passed; 1 failed; 0 ignored; 0 measured
I don't like that Connection timed out
error message. Could you please try increasing the the sleep duration right after the bind
and connect
calls ? Something like 250 ms for a start. If it does not work, we could try with the TCP transport.
I tried running it a few times. Sometimes it works, sometimes it fails. Very rarely even poll_works
fails. Doubling the sleep duration to 20 ms makes survey
not fail anymore.
Seems like a race condition to me.
Surveyor
sockets are internally using the broadcast operation to implement write
, which does not provide the guarantee that all peers will be sent the message. Only the endpoints that have been seen as 'writable' will be sent the message, and with the non-blocking options. So if a peer is not connected or not ready yet, Surveyor
, Bus
and Pub
sockets will not deliver the message to that peer.
So yes there is a race condition, in the test, and increasing the sleep time in the test is the right fix in my opinion. Regarding poll_works
, could you send more details about the failure ?
To be fair, I knew next to nothing about nanomsg when I wrote these tests. Between then and now I had the opportunity to take a closer look at how nanomsg is implemented, so I guess I could rewrite some, if not all, of these tests.
So yes there is a race condition, in the test, and increasing the sleep time in the test is the right fix in my opinion. Regarding poll_works, could you send more details about the failure ?
Wouldn't the right fix add a loop that waits until the peers are ready?
This is the backtrace of the very rare poll_works
failure:
---- tests::poll_works stdout ----
thread 'tests::poll_works' panicked at 'assertion failed: `(left == right)` (left: `true`, right: `false`)', src/lib.rs:2030
stack backtrace:
1: 0x55a535186a20 - sys::backtrace::tracing::imp::write::h578d6e47592a4041cHv
2: 0x55a5351893db - panicking::default_handler::_$u7b$$u7b$closure$u7d$$u7d$::closure.44995
3: 0x55a535188f93 - panicking::default_handler::hffa00c845e94c6581nA
4: 0x55a53517ca3c - sys_common::unwind::begin_unwind_inner::hd824946792143cc8vvu
5: 0x55a53517cec8 - sys_common::unwind::begin_unwind_fmt::h3777a045191296c5Buu
6: 0x55a5351418d3 - tests::poll_works::he3b3430f65699b50QUc
at ~/src/nanomsg.rs/<std macros>:8
7: 0x55a53515fa96 - boxed::F.FnBox<A>::call_box::h17494710956627492744
8: 0x55a53516218b - sys_common::unwind::try::try_fn::h9823696487615081863
9: 0x55a535185e1b - __rust_try
10: 0x55a53518312d - sys_common::unwind::inner_try::h8ac2089dadb4381bxsu
11: 0x55a53516250a - boxed::F.FnBox<A>::call_box::h14092234691203767508
12: 0x55a535187e89 - sys::thread::Thread::new::thread_start::hd111df90784ef3b2llz
13: 0x7f03c2aab423 - start_thread
14: 0x7f03c25d4cbc - clone
15: 0x0 - <unknown>
Sadly, I don't know how we could check from the test that nanomsg sees the peers and knows they are 'writable'.
In the poll_works
tests, I can see the same sleep time used, would increasing it to 50 ms fix the problem ?
When running the tests, only 41 out of 42 finish.