pmodels / mpich

Official MPICH Repository
http://www.mpich.org
Other
535 stars 280 forks source link

Context ID exhaustion bug #1768

Closed mpichbot closed 7 years ago

mpichbot commented 7 years ago

Originally by dinan on 2012-12-14 11:24:17 -0600


Reported by Bob Cernhous @ IBM:

I've had a hang reported on BG/Q after about 2K MPI_Comm_create's.

It hangs on the latest 2 releases (mpich2 v1.5.x and v1.4.x) on BG/Q.

It also hangs on linux: 64bit (MPI over PAMI) MPICH2 library.

On older mpich 1.? (BG/P) it failed with 'too many communicators' and didn't hang, which is what they expected.

It seems like it's stuck in the while (*context_id == 0) loop repeatedly calling allreduce and never settling on a context id in commutil.c. I didn't do a lot of debug but seems like it's in vanilla mpich code, not something we modified.

ftmain.f90 fails if you run it on >2k ranks (creates one comm per rank). This was the original customer testcase.

ftmain2.f90 fails by looping so you can run on fewer ranks.

I just noticed that with --np 1, I get the 'too many communicators' from ftmain2. But --np 2 and up hangs.

stdout[0]:  check_newcomm do-start           0 , repeat         2045 , total        2046
stderr[0]: Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal error in PMPI_Comm_create: Other MPI error, error stack:
stderr[0]: PMPI_Comm_create(609).........: MPI_Comm_create(MPI_COMM_WORLD, group=0xc80700f6, new_comm=0x1dbfffb520) failed
stderr[0]: PMPI_Comm_create(590).........:
stderr[0]: MPIR_Comm_create_intra(250)...:
stderr[0]: MPIR_Get_contextid(521).......:
stderr[0]: MPIR_Get_contextid_sparse(752): Too many communicators 
mpichbot commented 7 years ago

Originally by dinan on 2012-12-14 11:24:30 -0600


Attachment added: ftmain.f90 (3.7 KiB) Test case #1

mpichbot commented 7 years ago

Originally by dinan on 2012-12-14 11:24:41 -0600


Attachment added: ftmain2.f90 (3.8 KiB) Test case #2

mpichbot commented 7 years ago

Originally by dinan on 2012-12-17 14:03:19 -0600


Resolved in [3c720d0887e5aab1a21ada789717d29795ccbd46].