thehydroimpulse / nanomsg.rs

Nanomsg library for Rust
Other
389 stars 56 forks source link

Tests hang and don't seem to finish #149

Closed vks closed 8 years ago

vks commented 8 years ago

When running the tests, only 41 out of 42 finish.

► rustc --version
rustc 1.9.0-nightly (998a6720b 2016-03-07)
► pacman -Q nanomsg
nanomsg 0.8-1
► cargo test
     Running target/debug/nanomsg-da8752b8a62b8468

running 42 tests
test result::tests::can_convert_error_code_to_error ... ok
test result::tests::nano_err_can_be_converted_to_io_err ... ok
test tests::bool_to_c_int_sanity ... ok
test result::tests::nano_err_can_be_converted_from_io_err ... ok
test tests::check_allocate ... ok
test tests::initialize_socket ... ok
test tests::connect_and_shutdown ... ok
test tests::bind_and_shutdown ... ok
test tests::bind_socket ... ok
test tests::bus ... ok
test tests::nb_write_works_in_both_cases ... ok
test tests::pipeline_mt1 ... ok
test tests::pipeline ... ok
test tests::pair ... ok
test tests::protocol_matches_raw ... ok
test tests::pipeline_mt2 ... ok
test tests::should_change_ipv4_only ... ok
test tests::should_change_linger ... ok
test tests::should_change_max_reconnect_interval ... ok
test tests::should_change_receive_buffer_size ... ok
test tests::should_change_receive_priority ... ok
test tests::should_change_receive_timeout ... ok
test tests::should_change_reconnect_interval ... ok
test tests::should_change_request_resend_interval ... ok
test tests::should_change_send_buffer_size ... ok
test tests::should_change_send_priority ... ok
test tests::should_change_send_timeout ... ok
test tests::should_change_socket_name ... ok
test tests::should_change_tcp_nodelay ... ok
test tests::should_get_receive_fd ... ok
test tests::should_get_send_fd ... ok
test tests::should_get_socket_name ... ok
test tests::nb_read_works_in_both_cases ... ok
test tests::nb_read_to_end_works_in_both_cases ... ok
test tests::connect_push_to_multi_ep ... ok
test tests::read_when_buffer_is_smaller_than_msg_return_buffer_size ... ok
test tests::bind_pull_to_multi_ep ... ok
test tests::poll_works ... ok
test tests::zero_copy_works ... ok
test tests::test_read_to_end ... ok
test tests::pubsub ... ok
^C
thehydroimpulse commented 8 years ago

Thanks @vks for the report! This seems a bit odd as CI builds are reporting successful on both Linux and Windows.

I'll dig a little deeper in this shortly.

blabaere commented 8 years ago

The failing test is survey. I added a timeout on the send & recv side of each socket so we can see where the test is blocked.

vks commented 8 years ago

Now I'm getting this:

► env RUST_BACKTRACE=1 cargo test
[...]
test tests::pubsub ... ok
test tests::survey ... FAILED

failures:

---- tests::survey stdout ----
    thread 'tests::survey' panicked at 'Connection timed out', src/lib.rs:1334
stack backtrace:
   1:     0x555a28fb8a00 - sys::backtrace::tracing::imp::write::h578d6e47592a4041cHv
   2:     0x555a28fbb3bb - panicking::default_handler::_$u7b$$u7b$closure$u7d$$u7d$::closure.44995
   3:     0x555a28fbaf73 - panicking::default_handler::hffa00c845e94c6581nA
   4:     0x555a28faea1c - sys_common::unwind::begin_unwind_inner::hd824946792143cc8vvu
   5:     0x555a28faeea8 - sys_common::unwind::begin_unwind_fmt::h3777a045191296c5Buu
   6:     0x555a28f653d3 - tests::test_read::h8537325e396e4b62rFb
                        at ~/src/nanomsg.rs/<std macros>:8
   7:     0x555a28f6d2b1 - tests::survey::h4b21b688e15f24d7Z6b
                        at src/lib.rs:1645
   8:     0x555a28f91a76 - boxed::F.FnBox<A>::call_box::h17494710956627492744
   9:     0x555a28f9416b - sys_common::unwind::try::try_fn::h9823696487615081863
  10:     0x555a28fb7dfb - __rust_try
  11:     0x555a28fb510d - sys_common::unwind::inner_try::h8ac2089dadb4381bxsu
  12:     0x555a28f944ea - boxed::F.FnBox<A>::call_box::h14092234691203767508
  13:     0x555a28fb9e69 - sys::thread::Thread::new::thread_start::hd111df90784ef3b2llz
  14:     0x7f6f8c04e423 - start_thread
  15:     0x7f6f8bb77cbc - clone
  16:                0x0 - <unknown>

failures:
    tests::survey

test result: FAILED. 41 passed; 1 failed; 0 ignored; 0 measured
blabaere commented 8 years ago

I don't like that Connection timed out error message. Could you please try increasing the the sleep duration right after the bind and connect calls ? Something like 250 ms for a start. If it does not work, we could try with the TCP transport.

vks commented 8 years ago

I tried running it a few times. Sometimes it works, sometimes it fails. Very rarely even poll_works fails. Doubling the sleep duration to 20 ms makes survey not fail anymore.

Seems like a race condition to me.

blabaere commented 8 years ago

Surveyor sockets are internally using the broadcast operation to implement write, which does not provide the guarantee that all peers will be sent the message. Only the endpoints that have been seen as 'writable' will be sent the message, and with the non-blocking options. So if a peer is not connected or not ready yet, Surveyor, Bus and Pub sockets will not deliver the message to that peer.

So yes there is a race condition, in the test, and increasing the sleep time in the test is the right fix in my opinion. Regarding poll_works, could you send more details about the failure ?

To be fair, I knew next to nothing about nanomsg when I wrote these tests. Between then and now I had the opportunity to take a closer look at how nanomsg is implemented, so I guess I could rewrite some, if not all, of these tests.

vks commented 8 years ago

So yes there is a race condition, in the test, and increasing the sleep time in the test is the right fix in my opinion. Regarding poll_works, could you send more details about the failure ?

Wouldn't the right fix add a loop that waits until the peers are ready?

This is the backtrace of the very rare poll_works failure:

---- tests::poll_works stdout ----
    thread 'tests::poll_works' panicked at 'assertion failed: `(left == right)` (left: `true`, right: `false`)', src/lib.rs:2030
stack backtrace:
   1:     0x55a535186a20 - sys::backtrace::tracing::imp::write::h578d6e47592a4041cHv
   2:     0x55a5351893db - panicking::default_handler::_$u7b$$u7b$closure$u7d$$u7d$::closure.44995
   3:     0x55a535188f93 - panicking::default_handler::hffa00c845e94c6581nA
   4:     0x55a53517ca3c - sys_common::unwind::begin_unwind_inner::hd824946792143cc8vvu
   5:     0x55a53517cec8 - sys_common::unwind::begin_unwind_fmt::h3777a045191296c5Buu
   6:     0x55a5351418d3 - tests::poll_works::he3b3430f65699b50QUc
                        at ~/src/nanomsg.rs/<std macros>:8
   7:     0x55a53515fa96 - boxed::F.FnBox<A>::call_box::h17494710956627492744
   8:     0x55a53516218b - sys_common::unwind::try::try_fn::h9823696487615081863
   9:     0x55a535185e1b - __rust_try
  10:     0x55a53518312d - sys_common::unwind::inner_try::h8ac2089dadb4381bxsu
  11:     0x55a53516250a - boxed::F.FnBox<A>::call_box::h14092234691203767508
  12:     0x55a535187e89 - sys::thread::Thread::new::thread_start::hd111df90784ef3b2llz
  13:     0x7f03c2aab423 - start_thread
  14:     0x7f03c25d4cbc - clone
  15:                0x0 - <unknown>
blabaere commented 8 years ago

Sadly, I don't know how we could check from the test that nanomsg sees the peers and knows they are 'writable'.

In the poll_works tests, I can see the same sleep time used, would increasing it to 50 ms fix the problem ?