Portals is a low-level network API for high-performance networking on high-performance computing systems developed by Sandia National Laboratories, Intel Corporation, and the University of New Mexico. The Portals 4 Reference Implementation is a complete implementation of Portals 4, with transport over InfiniBand VERBS and UDP. Shared memory transport is available as an optimization, including Linux KNEM support. The Portals 4 reference implementation is supported on both modern 64 bit Linux and 64 bit Mac OS X. The reference implementation has been developed by Sandia National Laboratories, Intel Corporation, and System Fabric Works. For more information on the Portals 4 standard, please see the Portals 4 page.
While running some tests I experienced occasional test hang in a test that does
a lot of broadcasts. I went back into our library tests and created a small
example that manifests the problem.
Example is rather simple and involves only two threads and one triggered
operation (enclosed code is able to run multiple threads):
Thread 0 - sets up a triggered op to send a value to Thread 1 once it received
a message that Thread 1 is ready.
Thread 1 - sends a message to Thread 0 and waits to receive a broadcast value
back from Thread 0.
The enclosed Makefile builds "test_broadcast" executable that does not
use triggered ops and works as expected. "test_brodacast-t" uses triggered ops
and hangs after running for a while. If I introduce 1us delay in any of the
threads, the test with triggered ops passes (triggered op is called before or
after the message from Thread 1 arrived). Run as "yod -n 2 test_broadcast-t".
You might need to run the program few times. I ran the test with threads on the
same or different nodes.
When system hangs I see that Thread 0 received the message from Thread 1 but
somehow triggered op never took place as Thread 0 waits for MD ack, while
Thread 1 waits for value to receive from Thread 0. The message from T1 to T0
arrived (which is confirmed by the CTWait and the actual value in the buffer).
Original issue reported on code.google.com by nvukice...@gmail.com on 19 Jul 2012 at 11:13
Original issue reported on code.google.com by
nvukice...@gmail.com
on 19 Jul 2012 at 11:13Attachments: