pmodels / mpich

Official MPICH Repository
http://www.mpich.org
Other
550 stars 281 forks source link

question about mpich2 implementation #418

Closed mpichbot closed 8 years ago

mpichbot commented 8 years ago

Originally by michael michael.bane@manchester.ac.uk on 2009-02-15 13:32:34 -0600



I believe the attached code should work correctly but I find that using
the mpich2 implementation on one particular box it hangs for odd numbers
of processors (not every time but frequently), whereas this code runs
fine on another box I've tried (albeit with OpenMPI)

Details are below and I'd welcome suggestions as to the cause of the
problem. Note that if I add WRITE statements or use the debugger the
problem appears to go away. Adding a FLUSH and BARRIER immediately after
the WRITE stmt makes no difference.

To confuse myself further, if I replace the MPI_Send() by MPI_SSend(),
ie synchronous, sometimes the code completes whereas other times it
appears to hang (see sync.out example at end)

Thanks, Michael

~/RCS/myCourses/Intro_to_MPI/MPI_Intro_exercises$ cat
mkb_ring_solution_send_and_recv_portable.f90
        PROGRAM ring
! this program will work on all MPI implementations
        USE MPI
        IMPLICIT NONE

! since we're only sending a single message between any src/dest pair we
can use a single tag
        INTEGER, PARAMETER :: myTag=101

        INTEGER :: ierror, inputRank, myRank, size
        INTEGER :: sendTo, recvFrom
        INTEGER :: recv_status(MPI_STATUS_SIZE)

! initialise MPI
        CALL MPI_INIT(ierror)

! determine my rank and total size
        CALL MPI_COMM_RANK(MPI_COMM_WORLD, myRank, ierror)
        CALL MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)

! set up which process rank is to my right (ie clockwise) for sending
        sendTo = myRank + 1
        IF (sendTo =# size) sendTo0
        recvFrom = myRank - 1
        IF (recvFrom =# -1) recvFromsize-1

! send my rank clockwise (from recvFrom to sendTo)

! to ensure nobody everybody is sending (and possibly waiting) at the
same time, we split into even (send then recv)
! and odd (recv then send)

        if (mod(myRank,2)==0) then
           call mpi_send(myRank,1,MPI_INTEGER,sendTo,  myTag, &
                MPI_COMM_WORLD,ierror)
           call mpi_recv(inputRank,1,MPI_INTEGER,recvFrom,myTag, &
                MPI_COMM_WORLD,recv_status,ierror)
        else
           call mpi_recv(inputRank,1,MPI_INTEGER,recvFrom,myTag, &
                MPI_COMM_WORLD,recv_status,ierror)
           call mpi_send(myRank,1,MPI_INTEGER,sendTo,  myTag, &
                MPI_COMM_WORLD,ierror)
        endif

        write(*,*) 'i am #',myRank,' and I received a new
rank=',inputRank

        CALL MPI_FINALIZE(ierror)

        END PROGRAM ring

~/RCS/myCourses/Intro_to_MPI/MPI_Intro_exercises$ mpif90 -O0
mkb_ring_solution_send_and_recv_portable.f90;mpif90 -show
ifort: Command line warning: overriding '-O3' with '-O0'
ifort -O3 -I/opt/mpi/mpibull2-0.9.7-2.t_RC4v4.3/include
-L/home/horace/mccssmb2/.mpibull2/lib
-L/opt/mpi/mpibull2-0.9.7-2.t_RC4v4.3/lib -lmpidev -lmpi -lrt -ldl
-lelan -lelanctrl -lcpuset

~/RCS/myCourses/Intro_to_MPI/MPI_Intro_exercises$ date;prun -O -n 5 -p
login ./a.out
Sun Feb 15 19:06:56 GMT 2009
Using qxelan driver, build for MPIBull2 0.9.7-t (Ishtar) 20060726-1607
 i am #           1  and I received a new rank=           0
 i am #           2  and I received a new rank=           1
 i am #           3  and I received a new rank=           2
 i am #           4  and I received a new rank=           3

(CNTL-C)
~/RCS/myCourses/Intro_to_MPI/MPI_Intro_exercises$ date
Sun Feb 15 19:23:47 GMT 2009

~/RCS/myCourses/Intro_to_MPI/MPI_Intro_exercises$ prun -I -n 5 -p
login ./sync.out
Using qxelan driver, build for MPIBull2 0.9.7-t (Ishtar) 20060726-1607
 i am #           1  and I received a new rank=           0
 i am #           2  and I received a new rank=           1

[prun -O means over-commit if required, -I means no over-commit but fail
without running if insufficient resources]
mpichbot commented 8 years ago

Originally by michael on 2009-02-15 13:32:34 -0600


This message has 0 attachment(s)

mpichbot commented 8 years ago

Originally by Rajeev Thakur on 2009-02-15 18:37:37 -0600



I believe you are using Bull MPI not stock MPICH2. At least if you use the
MPICH2 from Argonne, your program should work. You may want to contact the
support folks for Bull MPI to see why it doesn't work with that MPI.

Rajeev

> -----Original Message-----
> From: mpich2-bugs-bounces@mcs.anl.gov
> [mailto:mpich2-bugs-bounces@mcs.anl.gov] On Behalf Of mpich2
> Sent: Sunday, February 15, 2009 1:33 PM
> To: undisclosed-recipients:
> Subject: [mpich2-maint] #418: question about mpich2 implementation
>
> -----------------------------------------------------+--------
> ----------
> -----------------------------------------------------+----
>  Reporter:  michael <michael.bane@manchester.ac.uk>  |
> Type:  bug
>    Status:  new                                      |
> Priority:  major
> Milestone:                                           |
> Component:  mpich2
> -----------------------------------------------------+--------
> ----------
> -----------------------------------------------------+----
>
>
>  {{{
>
>  I believe the attached code should work correctly but I find
> that using  the mpich2 implementation on one particular box
> it hangs for odd numbers  of processors (not every time but
> frequently), whereas this code runs  fine on another box I've
> tried (albeit with OpenMPI)
>
>  Details are below and I'd welcome suggestions as to the
> cause of the  problem. Note that if I add WRITE statements or
> use the debugger the  problem appears to go away. Adding a
> FLUSH and BARRIER immediately after  the WRITE stmt makes no
> difference.
>
>  To confuse myself further, if I replace the MPI_Send() by
> MPI_SSend(),  ie synchronous, sometimes the code completes
> whereas other times it  appears to hang (see sync.out example at end)
>
>  Thanks, Michael
>
>  ~/RCS/myCourses/Intro_to_MPI/MPI_Intro_exercises$ cat
> mkb_ring_solution_send_and_recv_portable.f90
>          PROGRAM ring
>  ! this program will work on all MPI implementations
>          USE MPI
>          IMPLICIT NONE
>
>  ! since we're only sending a single message between any
> src/dest pair we  can use a single tag
>          INTEGER, PARAMETER :: myTag=101
>
>          INTEGER :: ierror, inputRank, myRank, size
>          INTEGER :: sendTo, recvFrom
>          INTEGER :: recv_status(MPI_STATUS_SIZE)
>
>
>  ! initialise MPI
>          CALL MPI_INIT(ierror)
>
>  ! determine my rank and total size
>          CALL MPI_COMM_RANK(MPI_COMM_WORLD, myRank, ierror)
>          CALL MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
>
>  ! set up which process rank is to my right (ie clockwise) for sending
>          sendTo = myRank + 1
>          IF (sendTo =# size) sendTo0
>          recvFrom = myRank - 1
>          IF (recvFrom =# -1) recvFromsize-1
>
>  ! send my rank clockwise (from recvFrom to sendTo)
>
>  ! to ensure nobody everybody is sending (and possibly
> waiting) at the  same time, we split into even (send then
> recv)  ! and odd (recv then send)
>
>          if (mod(myRank,2)==0) then
>             call mpi_send(myRank,1,MPI_INTEGER,sendTo,  myTag, &
>                  MPI_COMM_WORLD,ierror)
>             call mpi_recv(inputRank,1,MPI_INTEGER,recvFrom,myTag, &
>                  MPI_COMM_WORLD,recv_status,ierror)
>          else
>             call mpi_recv(inputRank,1,MPI_INTEGER,recvFrom,myTag, &
>                  MPI_COMM_WORLD,recv_status,ierror)
>             call mpi_send(myRank,1,MPI_INTEGER,sendTo,  myTag, &
>                  MPI_COMM_WORLD,ierror)
>          endif
>
>          write(*,*) 'i am #',myRank,' and I received a new
> rank=',inputRank
>
>          CALL MPI_FINALIZE(ierror)
>
>          END PROGRAM ring
>
>  ~/RCS/myCourses/Intro_to_MPI/MPI_Intro_exercises$ mpif90 -O0
>  mkb_ring_solution_send_and_recv_portable.f90;mpif90 -show
>  ifort: Command line warning: overriding '-O3' with '-O0'
>  ifort -O3 -I/opt/mpi/mpibull2-0.9.7-2.t_RC4v4.3/include
>  -L/home/horace/mccssmb2/.mpibull2/lib
>  -L/opt/mpi/mpibull2-0.9.7-2.t_RC4v4.3/lib -lmpidev -lmpi
> -lrt -ldl  -lelan -lelanctrl -lcpuset
>
>  ~/RCS/myCourses/Intro_to_MPI/MPI_Intro_exercises$ date;prun
> -O -n 5 -p  login ./a.out  Sun Feb 15 19:06:56 GMT 2009
> Using qxelan driver, build for MPIBull2 0.9.7-t (Ishtar) 20060726-1607
>   i am #           1  and I received a new rank=           0
>   i am #           2  and I received a new rank=           1
>   i am #           3  and I received a new rank=           2
>   i am #           4  and I received a new rank=           3
>
>  (CNTL-C)
>  ~/RCS/myCourses/Intro_to_MPI/MPI_Intro_exercises$ date  Sun
> Feb 15 19:23:47 GMT 2009
>
>  ~/RCS/myCourses/Intro_to_MPI/MPI_Intro_exercises$ prun -I -n
> 5 -p  login ./sync.out  Using qxelan driver, build for
> MPIBull2 0.9.7-t (Ishtar) 20060726-1607
>   i am #           1  and I received a new rank=           0
>   i am #           2  and I received a new rank=           1
>
>  [prun -O means over-commit if required, -I means no
> over-commit but fail  without running if insufficient resources]
>
>  ```
>
>
> --
> Ticket URL: <https://trac.mcs.anl.gov/projects/mpich2/ticket/418>
>

}}}
mpichbot commented 8 years ago

Originally by michael bane on 2009-02-15 18:58:52 -0600



Yes, it is a Bull supplied MPI (I thought just their mpich2 build but
can check in the morning).

As much as anything I wanted to double check that the code *should*
work...

ta, M

> -----------------------------------------------------
> +----------------------
> Reporter:  michael <michael.bane@manchester.ac.uk>  |        Owner:
>     Type:  bug                                      |       Status:
> new
> Priority:  major                                    |    Milestone:
> Component:  mpich2                                   |   Resolution:
> Keywords:                                           |
> -----------------------------------------------------
> +----------------------
>
>
> Comment (by Rajeev Thakur):
>
> {{{
>
> I believe you are using Bull MPI not stock MPICH2. At least if you
> use the
> MPICH2 from Argonne, your program should work. You may want to
> contact the
> support folks for Bull MPI to see why it doesn't work with that MPI.
>
> Rajeev
>
>
>> -----Original Message-----
>> From: mpich2-bugs-bounces@mcs.anl.gov
>> [mailto:mpich2-bugs-bounces@mcs.anl.gov] On Behalf Of mpich2
>> Sent: Sunday, February 15, 2009 1:33 PM
>> To: undisclosed-recipients:
>> Subject: [mpich2-maint] #418: question about mpich2 implementation
>>
>> -----------------------------------------------------+--------
>> ----------
>> -----------------------------------------------------+----
>> Reporter:  michael <michael.bane@manchester.ac.uk>  |
>> Type:  bug
>>   Status:  new                                      |
>> Priority:  major
>> Milestone:                                           |
>> Component:  mpich2
>> -----------------------------------------------------+--------
>> ----------
>> -----------------------------------------------------+----
>>
>>
>> {{{
>>
>> I believe the attached code should work correctly but I find
>> that using  the mpich2 implementation on one particular box
>> it hangs for odd numbers  of processors (not every time but
>> frequently), whereas this code runs  fine on another box I've
>> tried (albeit with OpenMPI)
>>
>> Details are below and I'd welcome suggestions as to the
>> cause of the  problem. Note that if I add WRITE statements or
>> use the debugger the  problem appears to go away. Adding a
>> FLUSH and BARRIER immediately after  the WRITE stmt makes no
>> difference.
>>
>> To confuse myself further, if I replace the MPI_Send() by
>> MPI_SSend(),  ie synchronous, sometimes the code completes
>> whereas other times it  appears to hang (see sync.out example at end)
>>
>> Thanks, Michael
>>
>> ~/RCS/myCourses/Intro_to_MPI/MPI_Intro_exercises$ cat
>> mkb_ring_solution_send_and_recv_portable.f90
>>         PROGRAM ring
>> ! this program will work on all MPI implementations
>>         USE MPI
>>         IMPLICIT NONE
>>
>> ! since we're only sending a single message between any
>> src/dest pair we  can use a single tag
>>         INTEGER, PARAMETER :: myTag=101
>>
>>         INTEGER :: ierror, inputRank, myRank, size
>>         INTEGER :: sendTo, recvFrom
>>         INTEGER :: recv_status(MPI_STATUS_SIZE)
>>
>>
>> ! initialise MPI
>>         CALL MPI_INIT(ierror)
>>
>> ! determine my rank and total size
>>         CALL MPI_COMM_RANK(MPI_COMM_WORLD, myRank, ierror)
>>         CALL MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
>>
>> ! set up which process rank is to my right (ie clockwise) for sending
>>         sendTo = myRank + 1
>>         IF (sendTo =# size) sendTo0
>>         recvFrom = myRank - 1
>>         IF (recvFrom =# -1) recvFromsize-1
>>
>> ! send my rank clockwise (from recvFrom to sendTo)
>>
>> ! to ensure nobody everybody is sending (and possibly
>> waiting) at the  same time, we split into even (send then
>> recv)  ! and odd (recv then send)
>>
>>         if (mod(myRank,2)==0) then
>>            call mpi_send(myRank,1,MPI_INTEGER,sendTo,  myTag, &
>>                 MPI_COMM_WORLD,ierror)
>>            call mpi_recv(inputRank,1,MPI_INTEGER,recvFrom,myTag, &
>>                 MPI_COMM_WORLD,recv_status,ierror)
>>         else
>>            call mpi_recv(inputRank,1,MPI_INTEGER,recvFrom,myTag, &
>>                 MPI_COMM_WORLD,recv_status,ierror)
>>            call mpi_send(myRank,1,MPI_INTEGER,sendTo,  myTag, &
>>                 MPI_COMM_WORLD,ierror)
>>         endif
>>
>>         write(*,*) 'i am #',myRank,' and I received a new
>> rank=',inputRank
>>
>>         CALL MPI_FINALIZE(ierror)
>>
>>         END PROGRAM ring
>>
>> ~/RCS/myCourses/Intro_to_MPI/MPI_Intro_exercises$ mpif90 -O0
>> mkb_ring_solution_send_and_recv_portable.f90;mpif90 -show
>> ifort: Command line warning: overriding '-O3' with '-O0'
>> ifort -O3 -I/opt/mpi/mpibull2-0.9.7-2.t_RC4v4.3/include
>> -L/home/horace/mccssmb2/.mpibull2/lib
>> -L/opt/mpi/mpibull2-0.9.7-2.t_RC4v4.3/lib -lmpidev -lmpi
>> -lrt -ldl  -lelan -lelanctrl -lcpuset
>>
>> ~/RCS/myCourses/Intro_to_MPI/MPI_Intro_exercises$ date;prun
>> -O -n 5 -p  login ./a.out  Sun Feb 15 19:06:56 GMT 2009
>> Using qxelan driver, build for MPIBull2 0.9.7-t (Ishtar)
>> 20060726-1607
>>  i am #           1  and I received a new rank=           0
>>  i am #           2  and I received a new rank=           1
>>  i am #           3  and I received a new rank=           2
>>  i am #           4  and I received a new rank=           3
>>
>> (CNTL-C)
>> ~/RCS/myCourses/Intro_to_MPI/MPI_Intro_exercises$ date  Sun
>> Feb 15 19:23:47 GMT 2009
>>
>> ~/RCS/myCourses/Intro_to_MPI/MPI_Intro_exercises$ prun -I -n
>> 5 -p  login ./sync.out  Using qxelan driver, build for
>> MPIBull2 0.9.7-t (Ishtar) 20060726-1607
>>  i am #           1  and I received a new rank=           0
>>  i am #           2  and I received a new rank=           1
>>
>> [prun -O means over-commit if required, -I means no
>> over-commit but fail  without running if insufficient resources]
>>
>> ```
>>
>>
>> --
>> Ticket URL: <https://trac.mcs.anl.gov/projects/mpich2/ticket/418>
>>
>
> }}}
>
> --
> Ticket URL: <https://trac.mcs.anl.gov/projects/mpich2/ticket/418#comment:
> >

}}}