sandialabs / portals4

Portals is a low-level network API for high-performance networking on high-performance computing systems developed by Sandia National Laboratories, Intel Corporation, and the University of New Mexico. The Portals 4 Reference Implementation is a complete implementation of Portals 4, with transport over InfiniBand VERBS and UDP. Shared memory transport is available as an optimization, including Linux KNEM support. The Portals 4 reference implementation is supported on both modern 64 bit Linux and 64 bit Mac OS X. The reference implementation has been developed by Sandia National Laboratories, Intel Corporation, and System Fabric Works. For more information on the Portals 4 standard, please see the Portals 4 page.
https://www.sandia.gov/portals/
Other
36 stars 17 forks source link

Possible lost trigger operation #5

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
While running some tests I experienced occasional test hang in a test that does 
a lot of broadcasts.  I went back into our library tests and created a small 
example that manifests the problem.

Example is rather simple and involves only two threads and one triggered 
operation (enclosed code is able to run multiple threads):

Thread 0 - sets up a triggered op to send a value to Thread 1 once it received 
a message that Thread 1 is ready.
Thread 1 - sends a message to Thread 0 and waits to receive a broadcast value 
back from Thread 0.

The enclosed Makefile builds "test_broadcast" executable that does not
use triggered ops and works as expected. "test_brodacast-t" uses triggered ops 
and hangs after running for a while. If I introduce 1us delay in any of the 
threads, the test with triggered ops passes (triggered op is called before or 
after the message from Thread 1 arrived). Run as "yod -n 2 test_broadcast-t".

You might need to run the program few times. I ran the test with threads on the 
same or different nodes.

When system hangs I see that Thread 0 received the message from Thread 1 but 
somehow triggered op never took place as Thread 0 waits for MD ack, while 
Thread 1 waits for value to receive from Thread 0. The message from T1 to T0 
arrived (which is confirmed by the CTWait and the actual value in the buffer). 

Original issue reported on code.google.com by nvukice...@gmail.com on 19 Jul 2012 at 11:13

Attachments:

GoogleCodeExporter commented 9 years ago

Original comment by fzs...@gmail.com on 26 Jul 2012 at 3:29