pmodels / mpich

Official MPICH Repository
http://www.mpich.org
Other
564 stars 279 forks source link

bug: occasional assertion error in thread/comm/idup_nb #6854

Open nikhilnanal opened 11 months ago

nikhilnanal commented 11 months ago

I am seeing the following errors while running the idup_nb test occassionally (may be like 1 in 5 times)

Assertion failed in file src/mpi/comm/contextid.c at line 239: mask[idx] & (1U << bitpos) ../middlewares/mpich_mpichtest/lib/libmpi.so.12(+0x541a76) [0x7f956021da76] ../middlewares/mpich_mpichtest/lib/libmpi.so.12(+0x44f884) [0x7f956012b884] ../middlewares/mpich_mpichtest/lib/libmpi.so.12(+0x3c98f0) [0x7f95600a58f0] ../middlewares/mpich_mpichtest/lib/libmpi.so.12(+0x3c9a04) [0x7f95600a5a04] ../middlewares/mpich_mpichtest/lib/libmpi.so.12(+0x52da8c) [0x7f9560209a8c] ../middlewares/mpich_mpichtest/lib/libmpi.so.12(+0x530dee) [0x7f956020cdee] ../middlewares/mpich_mpichtest/lib/libmpi.so.12(+0x3cb747) [0x7f95600a7747] ../middlewares/mpich_mpichtest/lib/libmpi.so.12(+0x3b4b02) [0x7f9560090b02] ../mpich_mpichtest/lib/libmpi.so.12(MPI_Comm_idup+0x212) [0x7f955fd7f822] ./idup_nb() [0x40264b] /lib64/libpthread.so.0(+0x81cf) [0x7f95628071cf] /lib64/libc.so.6(clone+0x43) [0x7f955f94fe73] Abort(1) on node 1: Internal error

raffenet commented 11 months ago

What is the configuration? Single node or multinode?

nikhilnanal commented 11 months ago

its a single node configuration (mpiexec -n 4 ./idup_nb).

raffenet commented 11 months ago

What about the MPICH build configuration?

nikhilnanal commented 11 months ago

Mpich build configuration: ./configure --prefix="path to mpich installation" --with-libfabric="path to libfabric build" --disable-fortran --with-device=ch4:ofi --without-ze

nikhilnanal commented 11 months ago

https://github.com/pmodels/mpich/issues/3794 which seems like a similar issue mentions to set --enable-posix-mutex=ticketlock. is this a mpich configure option? I dont see this in the configure help.

nikhilnanal commented 10 months ago

any suggestions on this issue?

nikhilnanal commented 10 months ago

sometimes I get the above mentioned assertion, while some other times i get this error Abort(1) on node 1: In MPIR_Free_contextid, the context id is not in use (Internal MPI error!)

raffenet commented 10 months ago

Sorry for the lack of update. What you are seeing looks to be a bug in the thread-safe idup implementation. We will look into it.

nikhilnanal commented 1 month ago

Hi, is there any update on this issue?

raffenet commented 1 month ago

Hi, is there any update on this issue?

Sorry, no update yet.

--enable-posix-mutex=ticketlock. is this a mpich configure option? I dont see this in the configure help.

This option gets passed to an internal convenience library that provides thread safety features for MPICH. You can try adding it to your build configuration and see if it makes a difference.