thehydroimpulse commented 9 years ago

I haven't been able to dig down into it but I want to figure out a way we can get this library to be multi-threaded. As far as I can tell, we aren't using threads/tasks at all in the tests anymore. Nanomsg in theory should be thread-safe (and that was one of the original goals versus something like ZeroMQ where it was strictly thread-unsafe.

blabaere commented 9 years ago

I created a multithreaded test and put the test runner inside a script loop. See libnanomsg https://github.com/blabaere/nanomsg.rs/tree/multithreadtest I found the following problems :

Sometime, the test runner would just refuse to die, gdb reports: No threads. Linux Process Explorer does not show any thread and displays the process as stuck in 'futex_wait_queue_me'.
Sometime, an assertion fails inside nanomsg :Bad file descriptor 9
Sometime, this one : Resource temporarily unavailable 11

thehydroimpulse commented 9 years ago

Thanks @blabaere for digging into this. Yeah, it seems really odd. I only started hitting these issues when I had gotten far enough in implementing this library and now it seems it's never not hitting those issues. I know nanomsg said they tried to be thread-safe but there may be a few areas which are not.

blabaere commented 9 years ago

Finally got one example where it would hang forever but let me debug it:

0 0x00007f51cf52c9b3 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6

1 0x00007f51d0049932 in nn_poller_wait (self=0x7f51d026c644, timeout=100) at src/aio/poller_epoll.inc:186

2 0x00007f51d004c35e in nn_worker_routine (arg=0x7f51d026c600) at src/aio/worker_posix.inc:176

3 0x00007f51d004e67d in nn_thread_main_routine (arg=) at src/utils/thread_posix.inc:35

blabaere commented 9 years ago

My bad, it looks like I should add some more sync between the tasks. To sum up, I feel like to make this work, one must make sure bind is effective before calling connect, and make sure receive is effective before calling send. And none of them is effective right away when called (this is the reason the sleep calls in the original tests)

Here, the sender task is waiting for the end of the test, and the receiver task is waiting for the message.

Receiver:

#0  0x00007ff33a09b963 in poll () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ff33abc83f8 in nn_efd_wait (self=<optimized out>, timeout=-1) at src/utils/efd.c:48
#2  0x00007ff33abc36c9 in nn_sock_recv (self=0x7ff320000930, msg=0x7ff333ffe7d0, flags=<optimized out>) at src/core/sock.c:671
#3  0x00007ff33abc0e07 in nn_recvmsg (s=<optimized out>, msghdr=0x7ff333ffe880, flags=<optimized out>) at src/core/global.c:872
#4  0x00007ff33abc110e in nn_recv (s=<optimized out>, buf=<optimized out>, len=<optimized out>, flags=<optimized out>) at src/core/global.c:707
#5  0x00007ff33b02fd6e in tests::test_receive::h0ec28f9462429bc7fqa () at src/lib.rs:335
#6  0x00007ff33b032c5d in tests::should_create_a_pipeline_mt::closure.3416 () at src/lib.rs:390

Sender:

#0  0x00007ff33a593d84 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007ff33b0d8eef in sync::barrier::Barrier::wait::h8f2657fc8601dd6brVp ()
#2  0x00007ff33b032311 in tests::should_create_a_pipeline_mt::closure.3377 () at src/lib.rs:378

blabaere commented 9 years ago

That blog also uses sleep between bind and connect, in seconds: http://tim.dysinger.net/posts/2013-09-16-getting-started-with-nanomsg.html

thehydroimpulse commented 9 years ago

Interesting. I'm wondering if that's the expected procedure for nanomsg, to wait an arbitrary length of time.

blabaere commented 9 years ago

It seems that thread-safety is bought with asynchronous behavior in some places. I will digg a little bit more.

Le mar. 9 déc. 2014 22:37, Daniel Fagnan notifications@github.com a écrit :

Interesting. I'm wondering if that's the expected procedure for nanomsg, to wait an arbitrary length of time.

— Reply to this email directly or view it on GitHub https://github.com/thehydroimpulse/nanomsg.rs/issues/70#issuecomment-66362504 .

blabaere commented 9 years ago

Indeed, using barrier and sleep all over the place to ensure that bind/connect and send/receive correct timings fixes the problem where the test runner would hang forever.

A call to nn_term and one more sleep fixes the nanomsg assertion failures.

The ugliness can be witnessed here: https://github.com/blabaere/nanomsg.rs/blob/multithreadtest/libnanomsg/src/lib.rs

@thehydroimpulse, are we sure we want to show this kind of stuff ?

thehydroimpulse commented 9 years ago

@blabaere mmmm, that's a bit unfortunate.

blabaere commented 9 years ago

Got it right this time !

The send/receive order issue I ~~saw~~ imagined was only the requirement to not close the sender too early, that is before the receiver ever gets a chance to see the message. And now it works without any sleep calls, even with several parallel tests.

And the bind/connect order requirements can also be dropped. But if you implement that ordering "manually" with barriers the test runs way faster than if you let nanomsg handle the "retry" (wild guess).

And guess who looks all nice and tidy now ?https://github.com/blabaere/nanomsg.rs/blob/multithreadtest/libnanomsg/src/lib.rs

thehydroimpulse commented 9 years ago

@blabaere Wow, that's awesome! That looks sooo much cleaner.

blabaere commented 9 years ago

To check the thread-safety of individual sockets, one should be shared between several tasks. But of course, Rust will not let you do that since all the read/write operations require a mutable socket. The mutability being specified in the traits, the only way I see to make this possible is to allow the cloning off the sockets.

@thehydroimpulse, do you see any problem with Socket having the Clone trait, or do you think there is another way to enable socket sharing between tasks ?

thehydroimpulse commented 9 years ago

@blabaere I don't see a huge problem with having Clone implemented for the socket. This is how the Tcp sockets in Rust work so that you can have a thread dedicated to blocking on receiving new connections and another thread can close the socket over the boundary. Otherwise there's no way to shutdown that thread.

thehydroimpulse commented 9 years ago

There's an implied ownership over sockets, however. Less so with nanomsg because they're not as generic as raw tcp. For example, if two threads can read and write on the same socket, you have to have some sort of implied ownership on what operations should be allowed on either thread (otherwise both threads can be reading and they both won't get the full message, it'll be split between the two)

blabaere commented 9 years ago

I didn't think of that message split problem, but it must be fun to debug too.

Another potential problem I see now is that nanomsg recycles the int values returned by nn_socket. If a user clones a Rust socket, drops one of the clones and then recreates another socket, there is a good chance he will end up with two different Rust sockets using the same C socket, with funny effects. This would probably be quite difficult to diagnose.

The choice we have is between :

a safe one (no change), with the downside of not allowing any sharing of sockets between tasks.
a powerful but risky one (clone), with the downside of letting a user shoot himself in the foot.
a middle ground through a pub unsafe fn to create a Rust socket from a C socket with a huge warning in the doc comments.

I'm not in favor of option 2 because I don't want to find myself in the situation mentioned above. And it looks like Rust is all about safety, but that's just my personal opinion here. From the nanomsg documentation ... using a single socket from multiple threads in parallel is still discouraged .... Maybe we could just wait for someone to come up with a real world use case and see then how to handle it.

thehydroimpulse commented 9 years ago

Yeah, I think that's a good approach. I definitely want nanomsg to be as safe as possible and I think not allowing clone would do this.

Parakleta commented 8 years ago

I've checked the C code and I'm confident that the library is properly thread-safe. Sockets are created under a global lock, and then interacted with under a private mutex.

I think you're right that you cannot have Clone/Copy because you have Drop and this is good.

I think the only mistake in the current implementation with regards to multithreading is that you implement the Read and Write traits. These are not the correct traits for this since nanomsg is packed based rather than stream based. Some of the derived methods actually do random weird things because recv always pulls a message and throws away the unused portion, so things like the Read iterators will just grab the first byte of each message rather than iterating through the message. Please see std::net::UdpSocket as an alternative.

Once the Read and Write traits are removed you can then also remove all mutable accesses (except for Drop) and then it becomes trivial to wrap the socket in an Arc and pass it between threads.

blabaere commented 8 years ago

Your remark about the Read and Write traits is interesting, even without considering the multi-threading part. I remember we already had to fix a bug related to the way these traits are used in std.

Regarding the removal of mutable accesses I think we need to review all socket methods and decide which we should fix. For example I'm not sure I would make bind and connect immutable. But I agree read and write should follow the UdpSocket example.

Parakleta commented 8 years ago

bind and connect are the cases I thought about being made mutable too, but then I realised that was just a feeling and not based on any logical argument I could make. bind and connect don't mutate anything that Rust can see and they're threadsafe. It doesn't matter which order they are called in with regards to other methods in a strict program correctness sense, and if the programmer requires an ordering they should enforce it themselves, not through an API.

Consider an example where bind and connect may be interspersed with other actions, a bus topology network with a kind of mesh connection strategy. A new client would connect to some other unit in the mesh and then send a broadcast "hello" message through that unit that would forward it to everyone it knows. The recipients of that message could in a thread, recieve the message, connect to the new unit, and send it a welcome message. I can't make any argument for why a mutex lock would need to occur around the connect in that scenario.

blabaere commented 8 years ago

I think that having mutability stated in signatures is not only about multi-threading, or pleasing the compiler or event observable effects in a program but also about communicating intents to developers using a library, and providing some guarantees. In the case of a library depending on nanomsg, let's call it X, borrowing a socket or a mutable socket would convey two different meanings. If connect and bind are requiring a mutable borrow, as a program developer using X and nanomsg, I could safely pass a configured pub socket, knowing it will be sending data to peers I control, and that X will not connect to some malicious subscriber, leaking all my precious data.

I understand the need to support a scenario where a socket is configured on the fly, based on the program execution outcome, including the content of messages exchanged over this socket. This is completely legitimate and I think doing this from multiple threads falls in the shared mutability category which requires special care to handle, and Rust is just making it explicit.

As a conclusion, this is technically feasible because thread-safety is guaranteed by nanomsg implementation itself, but I'm not sure this is the right thing to do. @thehydroimpulse any opinion on this ?

Parakleta commented 8 years ago

I find your argument compelling because I like the idea but I think digging deeper it's the wrong approach. From the Rust documentation:

So, that’s the real definition of ‘immutability’: is this safe to have two pointers to? In Arc’s case, yes: the mutation is entirely contained inside the structure itself. It’s not user facing. For this reason, it hands out &T with clone(). If it handed out &mut Ts, though, that would be a problem.

I suspect to get the kind of access control you might have to put the bind and connect methods in a separate trait and then the module X could pub use the socket and just use the trait.

blabaere commented 8 years ago

Another simple (too simple maybe ?) view on that problem is to consider that each of the two threads owns its socket, and if a message received from one socket could requires the application to call bind or connect on the other socket, then the application could simple have the first thread tell the second what to do. One could set up a thread to read commands from a regular channel, where the commands would something like, recv, send(msg), bind(addr), connect(addr).
Would this kind of design solve your problem in a practical way ?

Parakleta commented 8 years ago

I can already make the socket shareable between threads in a much simpler way using Arc<Mutex<T>>, my point was just that the Mutex<T> overhead is unnecessary because the underlying library already handles that for us, so we're doing the work twice and it's clumsy.

blabaere commented 8 years ago

OK for removing all mutability markers then.

palkeo commented 3 years ago

I found this issue while writing multi-threaded code that uses nb_write to publish a message, seems like this function still requires a mutable socket, so I need to wrap it around a lock when it seems like the underlying socket is perfectly thread-safe.

What's the status of this issue? Are the maintainers still ok to drop mutability markers but simply didn't get around to do it?

blabaere commented 3 years ago

I'm still totally OK with accepting a PR that would remove the mutability markers.

thehydroimpulse / nanomsg.rs

Multi-threading #70

0 0x00007f51cf52c9b3 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6

1 0x00007f51d0049932 in nn_poller_wait (self=0x7f51d026c644, timeout=100) at src/aio/poller_epoll.inc:186

2 0x00007f51d004c35e in nn_worker_routine (arg=0x7f51d026c600) at src/aio/worker_posix.inc:176

3 0x00007f51d004e67d in nn_thread_main_routine (arg=) at src/utils/thread_posix.inc:35