Closed dqwu closed 2 years ago
@roblatham00 This is the simplest reproducer I have so far, which still requires linking to ADIOS2 lib. I will try to replace the ADIOS2 code in the test program with all-MPI code if I can.
Valgrind output on ANL GCE nodes (Ubuntu 20, GCC 9.4.0)
==662465== Syscall param pwrite64(buf) points to uninitialised byte(s)
==662465== at 0x489CD1F: __libc_pwrite64 (pwrite64.c:29)
==662465== by 0x489CD1F: pwrite (pwrite64.c:27)
==662465== by 0x6DA550B: ADIOI_GEN_WriteContig (ad_write.c:75)
==662465== by 0x6DA6C2F: ADIOI_Exch_and_write (ad_write_coll.c:477)
==662465== by 0x6DA5D07: ADIOI_GEN_WriteStridedColl (ad_write_coll.c:185)
==662465== by 0x6D7CD38: MPIOI_File_write_all (write_all.c:168)
==662465== by 0x6D7D5F4: PMPI_File_write_at_all (write_atall.c:71)
==662465== by 0x1733FF: main (test_adios_mpiio.c:55)
==662465== Address 0x12135080 is 64 bytes inside a block of size 16,777,216 alloc'd
==662465== at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==662465== by 0x6DB4286: MPL_malloc (mpl_trmem.h:280)
==662465== by 0x6DB4327: ADIOI_Malloc_fn (malloc.c:41)
==662465== by 0x6D9CB5A: ADIO_Open (ad_open.c:135)
==662465== by 0x6D774ED: PMPI_File_open (open.c:143)
==662465== by 0x1733C8: main (test_adios_mpiio.c:47)
@dqwu Could you try reproducing this with 4.0.x romio + 3.4.x mpich or vice versa? I think you can do this by manually swapping the src/mpi/romio
folder in the source tree. This should narrow it down to whether ROMIO or MPICH contains the breaking changes.
Is the creating of fh1
critical in the reproducer?
@dqwu Could you try reproducing this with 4.0.x romio + 3.4.x mpich or vice versa? I think you can do this by manually swapping the
src/mpi/romio
folder in the source tree. This should narrow it down to whether ROMIO or MPICH contains the breaking changes.Is the creating of
fh1
critical in the reproducer?
Thanks for the suggestion, and I will test it later. Yes, the empty file created with fh1 is required to reproduce this issue.
@hzhou I tried to build MPICH 4.0.2 with ROMIO copied from MPICH 3.4.3, but there are some build errors:
...
make[3]: Entering directory '/scratch/wuda/source_install/mpich-4.0.2/src/mpi/romio'
CC mpi-io/close.lo
CC mpi-io/delete.lo
CC mpi-io/file_c2f.lo
CC mpi-io/file_f2c.lo
In file included from ./adio/include/adio.h:61:0,
from mpi-io/mpioimpl.h:13,
from mpi-io/delete.c:6:
./adio/include/romioconf.h:10:10: fatal error: mplconfig.h: No such file or directory
#include <mplconfig.h>
^~~~~~~~~~~~~
compilation terminated.
mplconfig.h can be found in the src directory, though.
find . -name 'mplconfig.h'
./src/mpl/include/mplconfig.h
Let's try some hacking --
$ cd src/mpi/romio
$ rm -rf mpl && cp -a ../../mpl .
$ make CPPFLAGS="-Impl/include -DMPL_FALLTHROUGH="
$ cd ../../..
$ make
Let's try some hacking --
$ cd src/mpi/romio $ rm -rf mpl && cp -a ../../mpl . $ make CPPFLAGS="-Impl/include -DMPL_FALLTHROUGH=" $ cd ../../.. $ make
This workaround works, but there are some linking errors:
lib/.libs/libmpi.so: undefined reference to `MPI_File_iwrite_c'
lib/.libs/libmpi.so: undefined reference to `MPI_File_iwrite_at_all_c'
...
lib/.libs/libmpi.so: undefined reference to `MPI_File_write_shared_c'
lib/.libs/libmpi.so: undefined reference toMakefile:14110: recipe for target 'src/env/mpichversion' failed
`make[2]: *** [src/env/mpichversion] Error 1
More hacking:
$ pushd src/glue/romio
$ ./all_romio_symbols ../../mpi/romio/include/mpio.h.in
$ popd
$ make
$ pushd src/glue/romio $ ./all_romio_symbols ../../mpi/romio/include/mpio.h.in $ popd $ make
Thanks, I was able to build MPICH 4.0.2 with ROMIO of MPICH 3.4.3, and this issue is not reproducible.
I will try MPICH 3.4.3 with ROMIO of MPICH 4.0.2 later.
Thanks, I was able to build MPICH 4.0.2 with ROMIO of MPICH 3.4.3, and this issue is not reproducible.
I guess that points the fault to the changes in ROMIO
Thanks, I was able to build MPICH 4.0.2 with ROMIO of MPICH 3.4.3, and this issue is not reproducible.
I guess that points the fault to the changes in ROMIO
I tried to build MPICH 3.4.3 with ROMIO of MPICH 4.0.2, the build errors are:
make[3]: Entering directory '/scratch/wuda/source_install/mpich-3.4.3/src/mpi/romio'
CC mpi-io/delete.lo
CC mpi-io/close.lo
CC mpi-io/file_c2f.lo
CC mpi-io/file_f2c.lo
In file included from /scratch/wuda/source_install/mpich-3.4.3/src/include/mpi.h:2225:0,
from ./adio/include/adio.h:65,
from mpi-io/mpioimpl.h:13,
from mpi-io/delete.c:6:
/scratch/wuda/source_install/mpich-3.4.3/src/include/mpio.h:340:49: error: unknown type name ‘MPI_Datarep_conversion_function_c’; did you mean ‘MPI_Datarep_conversion_function’?
int MPI_Register_datarep_c(const char *datarep, MPI_Datarep_conversion_function_c *read_conversion_fn,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
MPI_Datarep_conversion_function
...
Any hacking for this build issue?
Yeah, paste in src/include/mpi.h
after the line #define MPI_CONVERSION_FN_NULL ((MPI_Datarep_conversion_function *)0)
typedef int (MPI_Datarep_conversion_function_c)(void *, MPI_Datatype, MPI_Count,
void *, MPI_Offset, void *);
#define MPI_CONVERSION_FN_NULL_C ((MPI_Datarep_conversion_function_c *)0)
Yeah, paste in
src/include/mpi.h
after the line#define MPI_CONVERSION_FN_NULL ((MPI_Datarep_conversion_function *)0)
typedef int (MPI_Datarep_conversion_function_c)(void *, MPI_Datatype, MPI_Count, void *, MPI_Offset, void *); #define MPI_CONVERSION_FN_NULL_C ((MPI_Datarep_conversion_function_c *)0)
Thanks, this works to build MPICH 3.4.3 with ROMIO of MPICH 4.0.2.
Unfortunately, MPICH 3.4.3 with ROMIO of MPICH 4.0.2 cannot reproduce this issue.
In summary (tested on my laptop with Ubuntu 18 and GCC 7.4.0): MPICH 4.0.2 + default ROMIO: reproducible MPICH 4.0.2 + ROMIO of MPICH 3.4.3: not reproducible MPICH 3.4.3 + default ROMIO: not reproducible MPICH 3.4.3 + ROMIO of MPICH 4.0.2: not reproducible
@hzhou @roblatham00 Actually MPICH 4.0b1 cannot reproduce this issue but MPICH 4.0rc1 can.
MPICH 4.0b1 + default ROMIO: not reproducible MPICH 4.0rc1 + default ROMIO: reproducible MPICH 4.0b1 + ROMIO of MPICH 4.0rc1: reproducible MPICH 4.0rc1 + ROMIO of MPICH 4.0b1: not reproducible
This points the fault to the changes in ROMIO (between 4.0b1 and 4.0rc1, number of different files: 16).
It seems that this issue is caused by PR #5660. For MPICH 4.0.2, if I revert ADIOI_COLL_TAG macros (constant tag) back to myrank + i + 100 iter or myrank + p + 100 iter in ad_write_coll.c, this issue is not reproducible.
@hzhou @roblatham00 Finally I was able to find a potential bug inside ADIOS2 BP5 engine code
void MPIShmChain::HandshakeLinks_Start(helper::Comm &comm, HandshakeStruct &hs)
{
...
int rank = comm.Rank();
hs.sendToken = rank;
...
if (comm.Rank() > 0) // receive from previous
{
hs.recvRequest = comm.Irecv(
&hs.recvToken, 1, rank - 1, 0,
"Irecv handshake with neighbor, MPIChain aggregator, at Open");
}
else // rank 0 receives from last
{
hs.recvRequest = comm.Irecv(
&hs.recvToken, 1, rank - 1, 0,
"Irecv handshake with neighbor, MPIChain aggregator, at Open");
}
...
}
For the 2nd comm.Irecv call, the argument "rank - 1" should be "comm.Size() - 1" (the source parameter of MPI_Irecv).
Due to this bug (maybe a typo), rank 0 will not receive the integer sendToken from rank 3 (the value is set to 3 on rank 3), and this pending data will be received later inside the ROMIO write function (on rank 3, the first int value written is always fixed as 3, which is the previous sendToken on rank 3 lost inside ADIOS2 BP5 engine code).
@hzhou @roblatham00 I have a simple all-MPI test program to reproduce this issue now (only reproducible with MPICH 4.0rc1 or later when a constant tag 0 is used for collective write, since ADIOS2 BP5 engine code or this test program also uses tag 0 for MPI_Irecv/MPI_Irecv).
I think this issue can be closed later, but it will be helpful that newer versions of MPICH can detect an invalid source argument (e.g. -1) in MPI_Irecv and maybe give users a warning.
#include <stdio.h>
#include <mpi.h>
void simulate_ADIOS2_BP5_engine_code()
{
int rank;
int size;
MPI_Request send_request;
MPI_Request recv_request;
MPI_Status status1, status2;
MPI_Comm dupcomm1;
MPI_Comm_dup(MPI_COMM_WORLD, &dupcomm1);
MPI_Comm dupcomm2;
MPI_Comm_dup(MPI_COMM_WORLD, &dupcomm2);
MPI_Comm dupcomm3;
MPI_Comm_dup(MPI_COMM_WORLD, &dupcomm3);
MPI_Comm dupcomm4;
MPI_Comm_dup(MPI_COMM_WORLD, &dupcomm4);
MPI_Comm_rank(dupcomm4, &rank);
MPI_Comm_size(dupcomm4, &size);
/* Simulate MPIShmChain::HandshakeLinks_Start() */
int sendToken = rank * 1000; /* 3000 on rank 3 */
int recvToken = -1;
if (rank < size - 1) /* send to next */
MPI_Isend(&sendToken, 1, MPI_INT, rank + 1, 0, dupcomm4, &send_request);
else /* send to 0 to close the loop */
MPI_Isend(&sendToken, 1, MPI_INT, 0, 0, dupcomm4, &send_request);
if (rank > 0) /* receive from previous */
MPI_Irecv(&recvToken, 1, MPI_INT, rank - 1, 0, dupcomm4, &recv_request);
else /* rank 0 receives from last */
/* Here is the bug, "rank - 1" should be "size - 1", maybe a typo?
Since rank is 0, "rank - 1" points to an invalid source -1, such that
sendToken from last rank will not be received in this function, which
will be received inside MPI_File_write_at_all() below instead:
[rank = 3], read unexpected value at index 0, expected: 30, actual: 3000
*/
MPI_Irecv(&recvToken, 1, MPI_INT, rank - 1, 0, dupcomm4, &recv_request);
/* Simulate MPIShmChain::HandshakeLinks_Complete() */
MPI_Waitall(1, &send_request, &status1);
MPI_Waitall(1, &recv_request, &status2);
MPI_Comm_free(&dupcomm1);
MPI_Comm_free(&dupcomm2);
MPI_Comm_free(&dupcomm3);
MPI_Comm_free(&dupcomm4);
}
int main(int argc, char* argv[])
{
int rank;
const int element_per_pe = 5;
int write_buffer[element_per_pe];
int read_buffer[element_per_pe];
int i;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
for (i = 0; i < element_per_pe; i++) {
write_buffer[i] = rank * 10;
read_buffer[i] = -1;
}
MPI_Comm dummy_comm1;
MPI_Comm_dup(MPI_COMM_WORLD, &dummy_comm1);
/* Simulation of ADIOS2 BP5 engine code start */
simulate_ADIOS2_BP5_engine_code();
/* Simulation of ADIOS2 BP5 engine code end */
MPI_Comm_free(&dummy_comm1); /* This issue is not reproducible if we comment out this line */
MPI_Comm dummy_comm2;
MPI_Comm_dup(MPI_COMM_WORLD, &dummy_comm2);
MPI_Comm io_comm1;
MPI_Comm_dup(MPI_COMM_WORLD, &io_comm1);
MPI_File fh1;
MPI_File_open(io_comm1, "test_mpi_file1", MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, &fh1);
MPI_Comm io_comm2;
MPI_Comm_dup(MPI_COMM_WORLD, &io_comm2);
MPI_File fh2;
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "romio_cb_write", "enable");
MPI_File_open(io_comm2, "test_mpi_file2", MPI_MODE_CREATE | MPI_MODE_WRONLY, info, &fh2);
MPI_File_write_at_all(fh2, rank * element_per_pe * sizeof(int), write_buffer, element_per_pe, MPI_INT, MPI_STATUS_IGNORE);
MPI_File_close(&fh2);
MPI_Info_free(&info);
MPI_Comm_free(&io_comm2);
MPI_File_close(&fh1);
MPI_Comm_free(&io_comm1);
MPI_Comm_free(&dummy_comm2);
MPI_File fh3;
MPI_File_open(MPI_COMM_WORLD, "test_mpi_file2", MPI_MODE_RDONLY, MPI_INFO_NULL, &fh3);
MPI_File_read_at_all(fh3, rank * element_per_pe * sizeof(int), read_buffer, element_per_pe, MPI_INT, MPI_STATUS_IGNORE);
for (i = 0; i < element_per_pe; i++) {
if (read_buffer[i] != write_buffer[i])
printf("[rank = %d], read unexpected value at index %d, expected: %d, actual: %d\n", rank, i, write_buffer[i], read_buffer[i]);
}
MPI_File_close(&fh3);
MPI_Finalize();
return 0;
}
@dqwu Good job locating the bug!
but it will be helpful that newer versions of MPICH can detect an invalid source argument (e.g. -1) in MPI_Irecv and maybe give users a warning.
Unfortunately -1
is MPI_PROC_NULL
in MPICH, and MPI_PROC_NULL
is valid use by MPI standard.
@dqwu Good job locating the bug!
but it will be helpful that newer versions of MPICH can detect an invalid source argument (e.g. -1) in MPI_Irecv and maybe give users a warning.
Unfortunately
-1
isMPI_PROC_NULL
in MPICH, andMPI_PROC_NULL
is valid use by MPI standard.
Thanks for the information. As you can see, this bug is not easy to reproduce with MPICH 4.x. Actually, if we delete dupcomm1, dupcomm2, or dupcomm3 (any of them) in the latest test program, this bug cannot be reproduced (the ADIOS2 BP5 engine code creates a lot of MPI comms).
@hzhou @roblatham00 This issue can be closed since it turns out to be a bug of ADIOS2 BP5 engine code. PS, it has been fixed on ADIOS2 master branch, with added error checks in Comm send/recv wrapper functions to throw an error in case of bad source/dest, see https://github.com/ornladios/ADIOS2/pull/3345
@dqwu Thanks for the all-MPI reproducer. It did reveal an issue in mpich. What happens is -- there is a pending send message in dupcomm4
from simulate_ADIOS2_BP5_engine_code
. It lingers because there is currently no mechanism for canceling sent messages. The later io_comm2
reuses the same internal context_id
since dupcomm4
was freed already. The extra communicators in the reproducer are to make that context_id
collision happen.
I am not sure how to fix this yet, but ideally traffic from different communicators should never interfere.
Close this issue. The lingering send message issue is tracked in new issue #6184
This issue is only reproducible with MPICH 4.x (not reproducible with MPICH 3.x or OpenMPI). Below are detailed steps to reproduce it.
[Build and install MPICH 4.0.2]
[Add installed MPICH to $PATH]
export PATH=/path/to/mpich/installation/bin:$PATH
[Build and install ADIOS2 2.8.3]
[Build and run the test program]
[Example output]
Content of the test program (test_adios_mpiio.c)