hpc / quo-vadis

A cross-stack coordination layer to dynamically map runtime components to hardware resources
BSD 3-Clause "New" or "Revised" License
7 stars 4 forks source link

ZMQ sends and threads conflict #162

Open GuillaumeMercier opened 1 month ago

GuillaumeMercier commented 1 month ago

ZMQ doesn't seem to like threads (I saw some stuff about that in the documentation but can't find it right now). Right now, all threads share the same context and thus uses the same socket to communicate with the server which might be an issue. Or maybe ZMQ sockets don't like bursts of messages. The error is the following:

[1,0]<stderr>: [quo-vadis error at (qvi-rmi.cc::qvi_zerr_msg::101)] zmq_msg_send() failed with errno=156384763 (Unknown error 156384763)
[1,0]<stderr>: [quo-vadis error at (qvi-rmi.cc::qvi_zerr_msg::101)] zmq_msg_send() truncated with errno=156384763 (Unknown error 156384763)

Adding some delay in the example (with a call to sleep) fixes the issues, adding a lock doesn't fix anything so I will investigtate the burst of messages. I already tried option for ZMQ sockets but without positive results.

See PR https://github.com/hpc/quo-vadis/pull/163

samuelkgutierrez commented 1 month ago

I'll take a look, @GuillaumeMercier. Thank you.

samuelkgutierrez commented 1 month ago

I have a fix doing it a brute-force way, but let me see if I can come up with a nicer solution.

GuillaumeMercier commented 1 month ago

And what is the current fix? I'm curious.

samuelkgutierrez commented 1 month ago

Having a context mutex managed by a lock_guard at the interface boundary. A little heavy handed, so I think I can do better.

GuillaumeMercier commented 1 month ago

But I think I tried this and it didn't work.

samuelkgutierrez commented 1 month ago

I'm not sure how you implemented it, but mine seems to do the trick.

GuillaumeMercier commented 1 month ago
struct qv_context_s {
    qvi_rmi_client_t *rmi = nullptr;
    qvi_zgroup_t *zgroup = nullptr;
    qvi_bind_stack_t *bind_stack = nullptr;
    pthread_mutex_t lock;
GuillaumeMercier commented 1 month ago

Ok, I'm puzzled.

GuillaumeMercier commented 1 month ago

I guess it has to do with when/where you do lock/unlock

GuillaumeMercier commented 1 month ago

I put the lock/unlock phase around the call to qv_bind_push in qv_thread_routine

samuelkgutierrez commented 1 month ago

@GuillaumeMercier your issues should be fix by #164. I've also pushed a new branch named thread-bug-work that has the fixes to build your code. When you are ready, please issue a pull request so I can merge your work into master.