pmodels / mpich

Official MPICH Repository
http://www.mpich.org
Other
560 stars 279 forks source link

romio: corruption with a simple test using ADIOS2 BP5 engine #6182

Closed dqwu closed 2 years ago

dqwu commented 2 years ago

This issue is only reproducible with MPICH 4.x (not reproducible with MPICH 3.x or OpenMPI). Below are detailed steps to reproduce it.

[Build and install MPICH 4.0.2]

wget https://www.mpich.org/static/downloads/4.0.2/mpich-4.0.2.tar.gz
tar zxf mpich-4.0.2.tar.gz
cd mpich-4.0.2
CFLAGS="-g" ./configure --prefix=/path/to/mpich/installation
make -j4
make install

[Add installed MPICH to $PATH] export PATH=/path/to/mpich/installation/bin:$PATH

[Build and install ADIOS2 2.8.3]

wget https://github.com/ornladios/ADIOS2/archive/refs/tags/v2.8.3.tar.gz
tar zxf v2.8.3.tar.gz
cd ADIOS2-2.8.3
mkdir build
cd build
CC=mpicc CXX=mpicxx FC=mpif90 \
CFLAGS="-O2 -g" CPPFLAGS="-O2 -g" CXXFLAGS="-O2 -g" FCFLAGS="-O2 -g" \
cmake -DCMAKE_INSTALL_PREFIX=/path/to/adios2/installation \
-DBUILD_SHARED_LIBS=OFF -DADIOS2_BUILD_EXAMPLES=OFF -DBUILD_TESTING=OFF \
-DADIOS2_USE_Blosc=OFF \
-DADIOS2_USE_BZip2=OFF \
-DADIOS2_USE_ZFP=OFF \
-DADIOS2_USE_SZ=OFF \
-DADIOS2_USE_MGARD=OFF \
-DADIOS2_USE_PNG=OFF \
-DADIOS2_USE_DataMan=OFF \
-DADIOS2_USE_DataSpaces=OFF \
-DADIOS2_USE_MHS=OFF \
-DADIOS2_USE_ZeroMQ=OFF \
-DADIOS2_USE_HDF5=OFF \
-DADIOS2_USE_Python=OFF \
-DADIOS2_USE_Fortran=OFF \
-DADIOS2_USE_Profiling=OFF \
..
make -j4
make install

[Build and run the test program]

mpicxx test_adios_mpiio.c -O0 -g -Wall \
-DADIOS2_USE_MPI \
-I/path/to/adios2/installation/include \
`/path/to/adios2/installation/bin/adios2-config --c-libs`

mpiexec -n 4 ./a.out

[Example output]

[rank = 3], read unexpected value at index 0, expected: 30, actual: 3
[rank = 3], read unexpected value at index 1, expected: 30, actual: 0
[rank = 3], read unexpected value at index 2, expected: 30, actual: 0
[rank = 3], read unexpected value at index 3, expected: 30, actual: 0
[rank = 3], read unexpected value at index 4, expected: 30, actual: 0

Content of the test program (test_adios_mpiio.c)

#include <stdio.h>
#include <mpi.h>
#include <adios2_c.h>

int main(int argc, char* argv[])
{
  int rank;
  const int element_per_pe = 5;
  int write_buffer[element_per_pe];
  int read_buffer[element_per_pe];
  int i;

  MPI_Init(&argc, &argv);

  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  for (i = 0; i < element_per_pe; i++) {
    write_buffer[i] = rank * 10;
    read_buffer[i] = -1;
  }

  MPI_Comm dummy_comm1;
  MPI_Comm_dup(MPI_COMM_WORLD, &dummy_comm1);

  adios2_adios *adiosH = adios2_init_mpi(MPI_COMM_WORLD);
  adios2_io *ioH = adios2_declare_io(adiosH, "test_adios.nc.bp0");
  adios2_set_engine(ioH, "BP5"); /* This issue is not reproducible if we change BP5 to BP4 */
  adios2_engine *engineH = adios2_open(ioH, "test_adios.nc.bp", adios2_mode_write);
  adios2_close(engineH);

  MPI_Comm_free(&dummy_comm1); /* This issue is not reproducible if we comment out this line */

  MPI_Comm dummy_comm2;
  MPI_Comm_dup(MPI_COMM_WORLD, &dummy_comm2);

  MPI_Comm io_comm1;
  MPI_Comm_dup(MPI_COMM_WORLD, &io_comm1);
  MPI_File fh1;
  MPI_File_open(io_comm1, "test_mpi_file1", MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, &fh1);

  MPI_Comm io_comm2;
  MPI_Comm_dup(MPI_COMM_WORLD, &io_comm2);
  MPI_File fh2;
  MPI_Info info;
  MPI_Info_create(&info);
  MPI_Info_set(info, "romio_cb_write", "enable");
  MPI_File_open(io_comm2, "test_mpi_file2", MPI_MODE_CREATE | MPI_MODE_WRONLY, info, &fh2);

  /* This issue is only reproducible with MPICH 4.x (MPICH 3.x or OpenMPI works fine)
     In src/mpi/romio/adio/common/ad_write_coll.c, fd->comm (duplicated from io_comm2)
     seems to be corrupted for MPI_Isend/MPI_Irecv, such that fd->io_buf on rank 0 fails
     to receive expected data from the last rank. If we replace fd->comm with MPI_COMM_WORLD,
     all results are correct.
   */
  MPI_File_write_at_all(fh2, rank * element_per_pe * sizeof(int), write_buffer, element_per_pe, MPI_INT, MPI_STATUS_IGNORE);
  MPI_File_close(&fh2);
  MPI_Info_free(&info);
  MPI_Comm_free(&io_comm2);

  MPI_File_close(&fh1);
  MPI_Comm_free(&io_comm1);

  MPI_Comm_free(&dummy_comm2);

  MPI_File fh3;
  MPI_File_open(MPI_COMM_WORLD, "test_mpi_file2", MPI_MODE_RDONLY, MPI_INFO_NULL, &fh3);
  MPI_File_read_at_all(fh3, rank * element_per_pe * sizeof(int), read_buffer, element_per_pe, MPI_INT, MPI_STATUS_IGNORE);
  for (i = 0; i < element_per_pe; i++) {
    if (read_buffer[i] != write_buffer[i])
      printf("[rank = %d], read unexpected value at index %d, expected: %d, actual: %d\n", rank, i, write_buffer[i], read_buffer[i]);
  }
  MPI_File_close(&fh3);

  adios2_finalize(adiosH); /* This issue is not reproducible if we call adios2_finalize() immediately after adios2_close() */

  MPI_Finalize();

  return 0;
}
dqwu commented 2 years ago

@roblatham00 This is the simplest reproducer I have so far, which still requires linking to ADIOS2 lib. I will try to replace the ADIOS2 code in the test program with all-MPI code if I can.

dqwu commented 2 years ago

Valgrind output on ANL GCE nodes (Ubuntu 20, GCC 9.4.0)

==662465== Syscall param pwrite64(buf) points to uninitialised byte(s)
==662465==    at 0x489CD1F: __libc_pwrite64 (pwrite64.c:29)
==662465==    by 0x489CD1F: pwrite (pwrite64.c:27)
==662465==    by 0x6DA550B: ADIOI_GEN_WriteContig (ad_write.c:75)
==662465==    by 0x6DA6C2F: ADIOI_Exch_and_write (ad_write_coll.c:477)
==662465==    by 0x6DA5D07: ADIOI_GEN_WriteStridedColl (ad_write_coll.c:185)
==662465==    by 0x6D7CD38: MPIOI_File_write_all (write_all.c:168)
==662465==    by 0x6D7D5F4: PMPI_File_write_at_all (write_atall.c:71)
==662465==    by 0x1733FF: main (test_adios_mpiio.c:55)
==662465==  Address 0x12135080 is 64 bytes inside a block of size 16,777,216 alloc'd
==662465==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==662465==    by 0x6DB4286: MPL_malloc (mpl_trmem.h:280)
==662465==    by 0x6DB4327: ADIOI_Malloc_fn (malloc.c:41)
==662465==    by 0x6D9CB5A: ADIO_Open (ad_open.c:135)
==662465==    by 0x6D774ED: PMPI_File_open (open.c:143)
==662465==    by 0x1733C8: main (test_adios_mpiio.c:47)
hzhou commented 2 years ago

@dqwu Could you try reproducing this with 4.0.x romio + 3.4.x mpich or vice versa? I think you can do this by manually swapping the src/mpi/romio folder in the source tree. This should narrow it down to whether ROMIO or MPICH contains the breaking changes.

Is the creating of fh1 critical in the reproducer?

dqwu commented 2 years ago

@dqwu Could you try reproducing this with 4.0.x romio + 3.4.x mpich or vice versa? I think you can do this by manually swapping the src/mpi/romio folder in the source tree. This should narrow it down to whether ROMIO or MPICH contains the breaking changes.

Is the creating of fh1 critical in the reproducer?

Thanks for the suggestion, and I will test it later. Yes, the empty file created with fh1 is required to reproduce this issue.

dqwu commented 2 years ago

@hzhou I tried to build MPICH 4.0.2 with ROMIO copied from MPICH 3.4.3, but there are some build errors:

...
make[3]: Entering directory '/scratch/wuda/source_install/mpich-4.0.2/src/mpi/romio'
  CC       mpi-io/close.lo
  CC       mpi-io/delete.lo
  CC       mpi-io/file_c2f.lo
  CC       mpi-io/file_f2c.lo
In file included from ./adio/include/adio.h:61:0,
                 from mpi-io/mpioimpl.h:13,
                 from mpi-io/delete.c:6:
./adio/include/romioconf.h:10:10: fatal error: mplconfig.h: No such file or directory
 #include <mplconfig.h>
          ^~~~~~~~~~~~~
compilation terminated.

mplconfig.h can be found in the src directory, though.

find . -name 'mplconfig.h'
./src/mpl/include/mplconfig.h
hzhou commented 2 years ago

Let's try some hacking --

$ cd src/mpi/romio 
$ rm -rf mpl && cp -a ../../mpl .
$ make CPPFLAGS="-Impl/include -DMPL_FALLTHROUGH="
$ cd ../../..
$ make
dqwu commented 2 years ago

Let's try some hacking --

$ cd src/mpi/romio 
$ rm -rf mpl && cp -a ../../mpl .
$ make CPPFLAGS="-Impl/include -DMPL_FALLTHROUGH="
$ cd ../../..
$ make

This workaround works, but there are some linking errors:

lib/.libs/libmpi.so: undefined reference to `MPI_File_iwrite_c'
lib/.libs/libmpi.so: undefined reference to `MPI_File_iwrite_at_all_c'
...
lib/.libs/libmpi.so: undefined reference to `MPI_File_write_shared_c'
lib/.libs/libmpi.so: undefined reference toMakefile:14110: recipe for target 'src/env/mpichversion' failed
 `make[2]: *** [src/env/mpichversion] Error 1
hzhou commented 2 years ago

More hacking:

$ pushd src/glue/romio
$  ./all_romio_symbols ../../mpi/romio/include/mpio.h.in
$ popd
$ make
dqwu commented 2 years ago

$ pushd src/glue/romio $ ./all_romio_symbols ../../mpi/romio/include/mpio.h.in $ popd $ make

Thanks, I was able to build MPICH 4.0.2 with ROMIO of MPICH 3.4.3, and this issue is not reproducible.

I will try MPICH 3.4.3 with ROMIO of MPICH 4.0.2 later.

hzhou commented 2 years ago

Thanks, I was able to build MPICH 4.0.2 with ROMIO of MPICH 3.4.3, and this issue is not reproducible.

I guess that points the fault to the changes in ROMIO

dqwu commented 2 years ago

Thanks, I was able to build MPICH 4.0.2 with ROMIO of MPICH 3.4.3, and this issue is not reproducible.

I guess that points the fault to the changes in ROMIO

I tried to build MPICH 3.4.3 with ROMIO of MPICH 4.0.2, the build errors are:

make[3]: Entering directory '/scratch/wuda/source_install/mpich-3.4.3/src/mpi/romio'
  CC       mpi-io/delete.lo
  CC       mpi-io/close.lo
  CC       mpi-io/file_c2f.lo
  CC       mpi-io/file_f2c.lo
In file included from /scratch/wuda/source_install/mpich-3.4.3/src/include/mpi.h:2225:0,
                 from ./adio/include/adio.h:65,
                 from mpi-io/mpioimpl.h:13,
                 from mpi-io/delete.c:6:
/scratch/wuda/source_install/mpich-3.4.3/src/include/mpio.h:340:49: error: unknown type name ‘MPI_Datarep_conversion_function_c’; did you mean ‘MPI_Datarep_conversion_function’?
 int MPI_Register_datarep_c(const char *datarep, MPI_Datarep_conversion_function_c *read_conversion_fn,
                                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                 MPI_Datarep_conversion_function
...

Any hacking for this build issue?

hzhou commented 2 years ago

Yeah, paste in src/include/mpi.h after the line #define MPI_CONVERSION_FN_NULL ((MPI_Datarep_conversion_function *)0)

typedef int (MPI_Datarep_conversion_function_c)(void *, MPI_Datatype, MPI_Count,
             void *, MPI_Offset, void *);
#define MPI_CONVERSION_FN_NULL_C ((MPI_Datarep_conversion_function_c *)0)
dqwu commented 2 years ago

Yeah, paste in src/include/mpi.h after the line #define MPI_CONVERSION_FN_NULL ((MPI_Datarep_conversion_function *)0)

typedef int (MPI_Datarep_conversion_function_c)(void *, MPI_Datatype, MPI_Count,
             void *, MPI_Offset, void *);
#define MPI_CONVERSION_FN_NULL_C ((MPI_Datarep_conversion_function_c *)0)

Thanks, this works to build MPICH 3.4.3 with ROMIO of MPICH 4.0.2.

Unfortunately, MPICH 3.4.3 with ROMIO of MPICH 4.0.2 cannot reproduce this issue.

In summary (tested on my laptop with Ubuntu 18 and GCC 7.4.0): MPICH 4.0.2 + default ROMIO: reproducible MPICH 4.0.2 + ROMIO of MPICH 3.4.3: not reproducible MPICH 3.4.3 + default ROMIO: not reproducible MPICH 3.4.3 + ROMIO of MPICH 4.0.2: not reproducible

dqwu commented 2 years ago

@hzhou @roblatham00 Actually MPICH 4.0b1 cannot reproduce this issue but MPICH 4.0rc1 can.

MPICH 4.0b1 + default ROMIO: not reproducible MPICH 4.0rc1 + default ROMIO: reproducible MPICH 4.0b1 + ROMIO of MPICH 4.0rc1: reproducible MPICH 4.0rc1 + ROMIO of MPICH 4.0b1: not reproducible

This points the fault to the changes in ROMIO (between 4.0b1 and 4.0rc1, number of different files: 16).

It seems that this issue is caused by PR #5660. For MPICH 4.0.2, if I revert ADIOI_COLL_TAG macros (constant tag) back to myrank + i + 100 iter or myrank + p + 100 iter in ad_write_coll.c, this issue is not reproducible.

dqwu commented 2 years ago

@hzhou @roblatham00 Finally I was able to find a potential bug inside ADIOS2 BP5 engine code

void MPIShmChain::HandshakeLinks_Start(helper::Comm &comm, HandshakeStruct &hs)
{
    ...
    int rank = comm.Rank();
    hs.sendToken = rank;
    ...
    if (comm.Rank() > 0) // receive from previous
    {
        hs.recvRequest = comm.Irecv(
            &hs.recvToken, 1, rank - 1, 0,
            "Irecv handshake with neighbor, MPIChain aggregator, at Open");
    }
    else // rank 0 receives from last
    {
        hs.recvRequest = comm.Irecv(
            &hs.recvToken, 1, rank - 1, 0,
            "Irecv handshake with neighbor, MPIChain aggregator, at Open");
    }
    ...
}

For the 2nd comm.Irecv call, the argument "rank - 1" should be "comm.Size() - 1" (the source parameter of MPI_Irecv).

Due to this bug (maybe a typo), rank 0 will not receive the integer sendToken from rank 3 (the value is set to 3 on rank 3), and this pending data will be received later inside the ROMIO write function (on rank 3, the first int value written is always fixed as 3, which is the previous sendToken on rank 3 lost inside ADIOS2 BP5 engine code).

dqwu commented 2 years ago

@hzhou @roblatham00 I have a simple all-MPI test program to reproduce this issue now (only reproducible with MPICH 4.0rc1 or later when a constant tag 0 is used for collective write, since ADIOS2 BP5 engine code or this test program also uses tag 0 for MPI_Irecv/MPI_Irecv).

I think this issue can be closed later, but it will be helpful that newer versions of MPICH can detect an invalid source argument (e.g. -1) in MPI_Irecv and maybe give users a warning.

#include <stdio.h>
#include <mpi.h>

void simulate_ADIOS2_BP5_engine_code()
{
  int rank;
  int size;
  MPI_Request send_request;
  MPI_Request recv_request;
  MPI_Status status1, status2;

  MPI_Comm dupcomm1;
  MPI_Comm_dup(MPI_COMM_WORLD, &dupcomm1);

  MPI_Comm dupcomm2;
  MPI_Comm_dup(MPI_COMM_WORLD, &dupcomm2);

  MPI_Comm dupcomm3;
  MPI_Comm_dup(MPI_COMM_WORLD, &dupcomm3);

  MPI_Comm dupcomm4;
  MPI_Comm_dup(MPI_COMM_WORLD, &dupcomm4);

  MPI_Comm_rank(dupcomm4, &rank);
  MPI_Comm_size(dupcomm4, &size);

  /* Simulate MPIShmChain::HandshakeLinks_Start() */
  int sendToken = rank * 1000; /* 3000 on rank 3 */
  int recvToken = -1;

  if (rank < size - 1) /* send to next */
    MPI_Isend(&sendToken, 1, MPI_INT, rank + 1, 0, dupcomm4, &send_request);
  else /* send to 0 to close the loop */
    MPI_Isend(&sendToken, 1, MPI_INT, 0, 0, dupcomm4, &send_request);

  if (rank > 0) /* receive from previous */
    MPI_Irecv(&recvToken, 1, MPI_INT, rank - 1, 0, dupcomm4, &recv_request);
  else /* rank 0 receives from last */
    /* Here is the bug, "rank - 1" should be "size - 1", maybe a typo?
       Since rank is 0, "rank - 1" points to an invalid source -1, such that
       sendToken from last rank will not be received in this function, which
       will be received inside MPI_File_write_at_all() below instead:
       [rank = 3], read unexpected value at index 0, expected: 30, actual: 3000
     */
    MPI_Irecv(&recvToken, 1, MPI_INT, rank - 1, 0, dupcomm4, &recv_request);

  /* Simulate MPIShmChain::HandshakeLinks_Complete() */
  MPI_Waitall(1, &send_request, &status1);
  MPI_Waitall(1, &recv_request, &status2);

  MPI_Comm_free(&dupcomm1);
  MPI_Comm_free(&dupcomm2);
  MPI_Comm_free(&dupcomm3);
  MPI_Comm_free(&dupcomm4);
}

int main(int argc, char* argv[])
{
  int rank;
  const int element_per_pe = 5;
  int write_buffer[element_per_pe];
  int read_buffer[element_per_pe];
  int i;

  MPI_Init(&argc, &argv);

  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  for (i = 0; i < element_per_pe; i++) {
    write_buffer[i] = rank * 10;
    read_buffer[i] = -1;
  }

  MPI_Comm dummy_comm1;
  MPI_Comm_dup(MPI_COMM_WORLD, &dummy_comm1);

  /* Simulation of ADIOS2 BP5 engine code start */
  simulate_ADIOS2_BP5_engine_code();
  /* Simulation of ADIOS2 BP5 engine code end */

  MPI_Comm_free(&dummy_comm1); /* This issue is not reproducible if we comment out this line */

  MPI_Comm dummy_comm2;
  MPI_Comm_dup(MPI_COMM_WORLD, &dummy_comm2);

  MPI_Comm io_comm1;
  MPI_Comm_dup(MPI_COMM_WORLD, &io_comm1);
  MPI_File fh1;
  MPI_File_open(io_comm1, "test_mpi_file1", MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, &fh1);

  MPI_Comm io_comm2;
  MPI_Comm_dup(MPI_COMM_WORLD, &io_comm2);
  MPI_File fh2;
  MPI_Info info;
  MPI_Info_create(&info);
  MPI_Info_set(info, "romio_cb_write", "enable");
  MPI_File_open(io_comm2, "test_mpi_file2", MPI_MODE_CREATE | MPI_MODE_WRONLY, info, &fh2);

  MPI_File_write_at_all(fh2, rank * element_per_pe * sizeof(int), write_buffer, element_per_pe, MPI_INT, MPI_STATUS_IGNORE);
  MPI_File_close(&fh2);
  MPI_Info_free(&info);
  MPI_Comm_free(&io_comm2);

  MPI_File_close(&fh1);
  MPI_Comm_free(&io_comm1);

  MPI_Comm_free(&dummy_comm2);

  MPI_File fh3;
  MPI_File_open(MPI_COMM_WORLD, "test_mpi_file2", MPI_MODE_RDONLY, MPI_INFO_NULL, &fh3);
  MPI_File_read_at_all(fh3, rank * element_per_pe * sizeof(int), read_buffer, element_per_pe, MPI_INT, MPI_STATUS_IGNORE);
  for (i = 0; i < element_per_pe; i++) {
    if (read_buffer[i] != write_buffer[i])
      printf("[rank = %d], read unexpected value at index %d, expected: %d, actual: %d\n", rank, i, write_buffer[i], read_buffer[i]);
  }
  MPI_File_close(&fh3);

  MPI_Finalize();

  return 0;
}
hzhou commented 2 years ago

@dqwu Good job locating the bug!

but it will be helpful that newer versions of MPICH can detect an invalid source argument (e.g. -1) in MPI_Irecv and maybe give users a warning.

Unfortunately -1 is MPI_PROC_NULL in MPICH, and MPI_PROC_NULL is valid use by MPI standard.

dqwu commented 2 years ago

@dqwu Good job locating the bug!

but it will be helpful that newer versions of MPICH can detect an invalid source argument (e.g. -1) in MPI_Irecv and maybe give users a warning.

Unfortunately -1 is MPI_PROC_NULL in MPICH, and MPI_PROC_NULL is valid use by MPI standard.

Thanks for the information. As you can see, this bug is not easy to reproduce with MPICH 4.x. Actually, if we delete dupcomm1, dupcomm2, or dupcomm3 (any of them) in the latest test program, this bug cannot be reproduced (the ADIOS2 BP5 engine code creates a lot of MPI comms).

dqwu commented 2 years ago

@hzhou @roblatham00 This issue can be closed since it turns out to be a bug of ADIOS2 BP5 engine code. PS, it has been fixed on ADIOS2 master branch, with added error checks in Comm send/recv wrapper functions to throw an error in case of bad source/dest, see https://github.com/ornladios/ADIOS2/pull/3345

hzhou commented 2 years ago

@dqwu Thanks for the all-MPI reproducer. It did reveal an issue in mpich. What happens is -- there is a pending send message in dupcomm4 from simulate_ADIOS2_BP5_engine_code. It lingers because there is currently no mechanism for canceling sent messages. The later io_comm2 reuses the same internal context_id since dupcomm4 was freed already. The extra communicators in the reproducer are to make that context_id collision happen.

I am not sure how to fix this yet, but ideally traffic from different communicators should never interfere.

hzhou commented 2 years ago

Close this issue. The lingering send message issue is tracked in new issue #6184