Open ompiteam opened 10 years ago
Imported from trac issue 2981. Created by jsquyres on 2012-01-26T17:42:14, last modified: 2014-05-20T17:59:11
Trac comment by jsquyres on 2012-01-26 17:43:12:
Oops -- this is a DDT issue, and I meant to assign it to George. :-)
Trac comment by jsquyres on 2012-04-17 11:18:51:
George -- can you have a look?
Trac comment by jsquyres on 2012-04-24 14:05:39:
No fix provided yet -- pushing to 1.6.1.
Trac comment by bosilca on 2014-05-20 17:59:11:
This is a more general issue we have in Open MPI with the tuned collectives. If the send and the receive datatypes and counts are not identical, the message splitting decision is wrong (as it split in repetitions of the entire datatype), leading to truncation in the best case and to wrong messages in the worst one. Without going through a packed version, there is no easy fix.
@bosilca can this be closed?
This isn't fixed and will not going to be. The simplest solution for application requiring collective with different type signature (but same typemap) is to disable all pipelining for MPI collectives.
@bosilca Is there a way to just disable the pipelining for MPI collectives? I think the big hammer is disabling the entire tuned collective component, but perhaps there's a better approach? I see you can force a non-pipelined algorithm for both bcase and reduce algorithms, but is there a better approach?
First, all pipeline algorithms suffers from this issue, not only those in the tuned collectives. Second, disabling tuned or more generally disabling pipelining will have a drastic performance impact on most applications (and not only for DL). Last, tuned is the only collective component that supports MPI_T as a mean to configure the collective decision per communicator (and there are several example on our mailing lists on how to achieve this for the tuned module).
related (but not same): #199 #1763
Per http://www.open-mpi.org/community/lists/devel/2012/01/10215.php, MPI_GATHER using coll:tuned, linear_sync can be truncated improperly.
I slightly modified the program that was originally sent and attached it here. It shows the problem for me on trunk and v1.5 (I assume it's also a problem on v1.4).
Many thanks for the bug report from Fujitsu.