open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.09k stars 850 forks source link

small array of derived data type(in Fortran) can be sent by MPI_Isend and MPI_Irecv but it ran into errors when I augment the array #12595

Open Bellahra opened 2 months ago

Bellahra commented 2 months ago

Please submit all the information below so that we can understand the working environment that is the context for your question.

Background information

I want to exchange some data of derived data types between several ranks. When the sent data is a small array, the data can be sent and received successfully. But if I changed the array from e(2:2) to e(200:200) and sent 2(100,1:100), it showed errors. I didn't revise any other part but just the dimension of the array. It is so strange. I also tested if this problem occurs when the data type is double precision and found that it didn't.

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v4.0.3

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

source

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running


Details of the problem

The derived data type is efield and MPI_EFIELD is the corresponding MPI_datatype. I use the MPI_Isend and MPI_Irecv to exchange the derived data e between rank 0 and rank 1. It works well when I send a small array, like, e(2,2). However , when I handled a larger array, e(200:200), and sent e(1,1:100), it ran into errors and it seemed that the data were not exchanged. The first is the example code of a small array, i.e., e(2,2), and it was followed by the output:

program main
  use mpi
  implicit none

  type efield
    double precision :: r1, r2, i1, i2
  end type efield
  integer :: status(MPI_STATUS_SIZE)
  integer :: rank, n_ranks, request, ierr, status0,neighbour
  type(efield), dimension(:,:), allocatable :: e
  type(efield) :: etype
  integer :: MPI_EFIELD
  integer, dimension(1:4) :: types = MPI_DOUBLE_PRECISION
  integer(MPI_ADDRESS_KIND) :: base_address, addr_r1, addr_r2, addr_i1, addr_i2
  integer(MPI_ADDRESS_KIND), dimension(4) :: displacements
  integer, dimension(4) :: block_lengths = 1
  ! Initialize MPI
  call MPI_Init(ierr)
  call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)
  call MPI_Comm_size(MPI_COMM_WORLD, n_ranks, ierr)
  ! Create MPI data type for efield
  call MPI_Get_address(etype%r1, addr_r1, ierr)
  call MPI_Get_address(etype%r2, addr_r2, ierr)
  call MPI_Get_address(etype%i1, addr_i1, ierr)
  call MPI_Get_address(etype%i2, addr_i2, ierr)
  call MPI_Get_address(etype, base_address, ierr)
  displacements(1) = addr_r1 - base_address
  displacements(2) = addr_r2 - base_address
  displacements(3) = addr_i1 - base_address
  displacements(4) = addr_i2 - base_address
  call MPI_Type_Create_Struct(4, block_lengths, displacements, types, MPI_EFIELD, ierr)
  call MPI_Type_Commit(MPI_EFIELD, ierr)
  print*,'MPI Create Struct: MPI_EFIELD',ierr
  allocate(e(2,2), STAT=status0)
  if (status0 /= 0) then
    print *, 'Allocation error on rank', rank, 'status', status0
    call MPI_Abort(MPI_COMM_WORLD, status0, ierr)
  else
    print *, rank, 'allocates e successfully', status0
  end if
  if (rank == 0) then
    e%r1 = 0.0
    e%r2 = 0.0
    e%i1 = 0.0
    e%i2 = 0.0 
    neighbour=1
  else if (rank == 1) then
    e%r1 = 1.0
    e%r2 = 1.0
    e%i1 = 1.0
    e%i2 = 1.0
    neighbour=0  
  end if
  call MPI_Isend(e(1,1), 1, MPI_EFIELD, neighbour, 0, MPI_COMM_WORLD, request, ierr)
  call MPI_Irecv(e(1,1), 1, MPI_EFIELD, neighbour, 0, MPI_COMM_WORLD, request, ierr)
  print *, 'before MPI_BARRIER', rank
  call MPI_BARRIER(MPI_COMM_WORLD, ierr)
  print *, 'after MPI_BARRIER'
  print*,rank,e(1,1)%r1,e(1,2)%r1,'rank, after, before'
  ! Cleanup
  deallocate(e, STAT=status0)
  if (status0 /= 0) then
    print *, 'Deallocation error on rank', rank, 'status', status0
  end if
  call MPI_Finalize(ierr)
end program main

output:

 MPI Create Struct: MPI_EFIELD           0
           0 allocates e successfully           0
 before MPI_BARRIER           0
 MPI Create Struct: MPI_EFIELD           0
           1 allocates e successfully           0
 before MPI_BARRIER           1
 after MPI_BARRIER
 after MPI_BARRIER
           1   0.0000000000000000        1.0000000000000000      rank, after, before
           0   1.0000000000000000        0.0000000000000000      rank, after, before

This is the second code, where I only changed the dimensions of e and the count of sent data, and it is followed by the output,

program main
  use mpi
  implicit none

  type efield
    double precision :: r1, r2, i1, i2
  end type efield
  integer :: status(MPI_STATUS_SIZE)
  integer :: rank, n_ranks, request, ierr, status0,neighbour
  type(efield), dimension(:,:), allocatable :: e
  type(efield) :: etype
  integer :: MPI_EFIELD
  integer, dimension(1:4) :: types = MPI_DOUBLE_PRECISION
  integer(MPI_ADDRESS_KIND) :: base_address, addr_r1, addr_r2, addr_i1, addr_i2
  integer(MPI_ADDRESS_KIND), dimension(4) :: displacements
  integer, dimension(4) :: block_lengths = 1
  ! Initialize MPI
  call MPI_Init(ierr)
  call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)
  call MPI_Comm_size(MPI_COMM_WORLD, n_ranks, ierr)
  ! Create MPI data type for efield
  call MPI_Get_address(etype%r1, addr_r1, ierr)
  call MPI_Get_address(etype%r2, addr_r2, ierr)
  call MPI_Get_address(etype%i1, addr_i1, ierr)
  call MPI_Get_address(etype%i2, addr_i2, ierr)
  call MPI_Get_address(etype, base_address, ierr)
  displacements(1) = addr_r1 - base_address
  displacements(2) = addr_r2 - base_address
  displacements(3) = addr_i1 - base_address
  displacements(4) = addr_i2 - base_address
  call MPI_Type_Create_Struct(4, block_lengths, displacements, types, MPI_EFIELD, ierr)
  call MPI_Type_Commit(MPI_EFIELD, ierr)
  print*,'MPI Create Struct: MPI_EFIELD',ierr
  allocate(e(200,200), STAT=status0)
  if (status0 /= 0) then
    print *, 'Allocation error on rank', rank, 'status', status0
    call MPI_Abort(MPI_COMM_WORLD, status0, ierr)
  else
    print *, rank, 'allocates e successfully', status0
  end if
  if (rank == 0) then
    e%r1 = 0.0
    e%r2 = 0.0
    e%i1 = 0.0
    e%i2 = 0.0 
    neighbour=1
  else if (rank == 1) then
    e%r1 = 1.0
    e%r2 = 1.0
    e%i1 = 1.0
    e%i2 = 1.0
    neighbour=0  
  end if
  call MPI_Isend(e(1,1:100), 100, MPI_EFIELD, neighbour, 0, MPI_COMM_WORLD, request, ierr)
  call MPI_Irecv(e(1,1:100), 100, MPI_EFIELD, neighbour, 0, MPI_COMM_WORLD, request, ierr)
  print *, 'before MPI_BARRIER', rank
  call MPI_BARRIER(MPI_COMM_WORLD, ierr)
  print *, 'after MPI_BARRIER'
  print*,rank,e(1,1)%r1,e(1,199)%r1,'rank, after, before'
  ! Cleanup
  deallocate(e, STAT=status0)
  if (status0 /= 0) then
    print *, 'Deallocation error on rank', rank, 'status', status0
  end if
  call MPI_Finalize(ierr)
end program main

output:

 MPI Create Struct: MPI_EFIELD           0
           1 allocates e successfully           0
 MPI Create Struct: MPI_EFIELD           0
           0 allocates e successfully           0
 before MPI_BARRIER           1
 before MPI_BARRIER           0
 after MPI_BARRIER
 after MPI_BARRIER
           0   0.0000000000000000        0.0000000000000000      rank, after, before
           1   1.0000000000000000        1.0000000000000000      rank, after, before

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x7f702bcbbd11 in ???
#1  0x7f702bcbaee5 in ???
#2  0x7f702baec08f in ???
    at /build/glibc-LcI20x/glibc-2.31/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
#3  0x7f702bb40f84 in _int_malloc
    at /build/glibc-LcI20x/glibc-2.31/malloc/malloc.c:3742
#4  0x7f702bb43298 in __GI___libc_malloc
    at /build/glibc-LcI20x/glibc-2.31/malloc/malloc.c:3066
#5  0x7f702acd2ff9 in ???
#6  0x7f702b9d2caa in ???
#7  0x7f702bfaec8c in ???
#8  0x55626630efa2 in ???
#9  0x55626630efe2 in ???
#10  0x7f702bacd082 in __libc_start_main
    at ../csu/libc-start.c:308
#11  0x55626630e1cd in ???
#12  0xffffffffffffffff in ???
#0  0x7f99e1df8d11 in ???
#1  0x7f99e1df7ee5 in ???
#2  0x7f99e1c2908f in ???
    at /build/glibc-LcI20x/glibc-2.31/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
#3  0x7f99e1c7df84 in _int_malloc
    at /build/glibc-LcI20x/glibc-2.31/malloc/malloc.c:3742
#4  0x7f99e1c80298 in __GI___libc_malloc
    at /build/glibc-LcI20x/glibc-2.31/malloc/malloc.c:3066
#5  0x7f99e0e0fff9 in ???
#6  0x7f99e1b0fcaa in ???
#7  0x7f99e20ebc8c in ???
#8  0x55c17b023fa2 in ???
#9  0x55c17b023fe2 in ???
#10  0x7f99e1c0a082 in __libc_start_main
    at ../csu/libc-start.c:308
#11  0x55c17b0231cd in ???
#12  0xffffffffffffffff in ???
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node bellpc exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

I also tested other cases when the data type is double precision, but it worked well. So I wondered what's the reason for this and how could I solve this problem.

bosilca commented 2 months ago

There are some major issues with this code, let me highlight two:

  1. you posted your intent to communicate (MPI_Isend and MPI_Irecv) but you never checked if the communications completed (MPI_Wait* or MPI_Test*). Until they are completed you are not supposed to use (for the receiver) or alter (for the sender) the buffers used in nonblocking communications.
  2. you are sending and receiving from the exact same buffer. What exactly do you expect to find inside, the original or the new values ? If you really want to replace the data you should use MPI_Sendrecv_replace.

Additional suggestions for improving this code:

  1. Open MPI does not have support for Fortran array descriptions (aka. CFI_desc_t). This means that the Fortran compiler will have to flatten each subarray you are sending as a packed array. Your datatype will still work, but you need to keep in mind the performance implication of this additional internal data management
  2. I would suggest you change your communication pattern to be MPI_Irecv followed by MPI_Isend, to make sure all communications are expected on the receiver side.
Bellahra commented 2 months ago

@bosilca Thank you very much for your kind and helpful reply. The original code works well after adding the MPI_Wait operations. I have a question about the difference between the combination of MPI_Isend & MPI_Irecv and MPI_Sendrecv_replace. In the above code, since different ranks own different values of one variable, the replacement operation is done by using the same buffer and then overwriting it with the MPI_Irecv operation. I am not clear about if there is any risk of doing so and the difference between this and using MPI_Sendrecv_replace. Looking forward to your reply.

ggouaillardet commented 2 months ago

Even if the issues reported by @bosilca are addressed, I do not think this can work: since subarrays are passed to MPI_Isend() and MPI_Irecv(), temporary flattened arrays are allocated by the Fortran runtime and deallocated when these subroutines return, which typically occurs before the data is sent or received, and hence undefined behavior which can be a crash.

Bottom line, subarrays should not be used with non blocking communications for now.

Note the MPI standard defines the MPI_SUBARRAYS_SUPPORTED and MPI_ASYNC_PROTECTS_NONBLOCKING "macros" and they are currently both .false. under Open MPI.

Bellahra commented 2 months ago

@ggouaillardet Thank you for your suggestion. It really helped me understand the issue better.