Open frizwi opened 3 years ago
I just tried this again and seems like its fixed in v4.1.x, but errors/hangs in 4.0.x and 3.1.4. I'll go ahead and update my application to the latest version so happy for this to be closed now.
Sorry, should've done a better job of version testing first before putting the issue in!
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v3.1.4 but have tried v4.1.0 as well - same result
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Built from source via tarball release
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.Please describe the system on which you are running
Ubuntu 18.04 LTS Single node
Details of the problem
Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.
Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:
Trying a very simple client/server mechanism exactly as per the MPI documentation.
Server, MPI_Open_port(..), display port name to stdout, then wait via MPI_Comm_accept
First launch the ompi-server, then Launch this as "mpirun -np 1 --ompi-server file:ompi.txt ./server beer"
In a separate shell, launch the client: "mpirun -np 1 --ompi-server file:ompi.txt ./client"
The issue is that MPI_Comm_connect hangs:
Here is the stack from gdb: [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". 0x00007f4f02574ad3 in futex_wait_cancelable (private=, expected=0, futex_word=0x55ceec04e110) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
88 ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
(gdb) bt
0 0x00007f4f02574ad3 in futex_wait_cancelable (private=, expected=0, futex_word=0x55ceec04e110) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
1 __pthread_cond_wait_common (abstime=0x0, mutex=0x55ceec04e0a8, cond=0x55ceec04e0e8) at pthread_cond_wait.c:502
2 __pthread_cond_wait (cond=0x55ceec04e0e8, mutex=0x55ceec04e0a8) at pthread_cond_wait.c:655
3 0x00007f4f0066eb72 in OPAL_MCA_PMIX2X_PMIx_Connect (procs=0x55ceec04e7b0, nprocs=2, info=0x0, ninfo=0) at client/pmix_client_connect.c:102
4 0x00007f4f006231c2 in pmix2x_connect (procs=0x7fff61344d40) at pmix2x_client.c:1346
5 0x00007f4f037c5793 in ompi_dpm_connect_accept (comm=0x55ceeb4b2520, root=0, port_string=0x7fff61346157 "619642881.0:622259260",
6 0x00007f4f0380cca3 in PMPI_Comm_connect (port_name=0x7fff61346157 "619642881.0:622259260", info=0x55ceeb4b2220, root=0,
7 0x000055ceeb2b0b6d in main (argc=2, argv=0x7fff613459f8) at client.c:25
(gdb) up
1 __pthread_cond_wait_common (abstime=0x0, mutex=0x55ceec04e0a8, cond=0x55ceec04e0e8) at pthread_cond_wait.c:502
502 pthread_cond_wait.c: No such file or directory. (gdb)
2 __pthread_cond_wait (cond=0x55ceec04e0e8, mutex=0x55ceec04e0a8) at pthread_cond_wait.c:655
655 in pthread_cond_wait.c (gdb)
3 0x00007f4f0066eb72 in OPAL_MCA_PMIX2X_PMIx_Connect (procs=0x55ceec04e7b0, nprocs=2, info=0x0, ninfo=0) at client/pmix_client_connect.c:102
102 PMIX_WAIT_THREAD(&cb->lock); (gdb)
4 0x00007f4f006231c2 in pmix2x_connect (procs=0x7fff61344d40) at pmix2x_client.c:1346
1346 ret = PMIx_Connect(p, nprocs, NULL, 0);
The same exact code works fine in OpenMPI v1.8.8 but hangs in v2.x, 3.x and 4.x. So has something changed?? One thing I've noticed is that on the same machine, the portname string looks very different. 1.8.8 seem to have a much longer string with an ip, portnumber and "tcp" withing but 3.x is something like "0000.0:00000". Is there some configure flag I'm missing?
Here are the exact codes
server.c:
And client.c