open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.17k stars 861 forks source link

client/server mechanism broken? #9396

Open frizwi opened 3 years ago

frizwi commented 3 years ago

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v3.1.4 but have tried v4.1.0 as well - same result

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Built from source via tarball release

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

Ubuntu 18.04 LTS Single node


Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:

shell$ mpirun -np 2 ./hello_world

Trying a very simple client/server mechanism exactly as per the MPI documentation.

Server, MPI_Open_port(..), display port name to stdout, then wait via MPI_Comm_accept

First launch the ompi-server, then Launch this as "mpirun -np 1 --ompi-server file:ompi.txt ./server beer"

In a separate shell, launch the client: "mpirun -np 1 --ompi-server file:ompi.txt ./client "

The issue is that MPI_Comm_connect hangs:

Here is the stack from gdb: [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". 0x00007f4f02574ad3 in futex_wait_cancelable (private=, expected=0, futex_word=0x55ceec04e110) at ../sysdeps/unix/sysv/linux/futex-internal.h:88 88 ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory. (gdb) bt

0 0x00007f4f02574ad3 in futex_wait_cancelable (private=, expected=0, futex_word=0x55ceec04e110) at ../sysdeps/unix/sysv/linux/futex-internal.h:88

1 __pthread_cond_wait_common (abstime=0x0, mutex=0x55ceec04e0a8, cond=0x55ceec04e0e8) at pthread_cond_wait.c:502

2 __pthread_cond_wait (cond=0x55ceec04e0e8, mutex=0x55ceec04e0a8) at pthread_cond_wait.c:655

3 0x00007f4f0066eb72 in OPAL_MCA_PMIX2X_PMIx_Connect (procs=0x55ceec04e7b0, nprocs=2, info=0x0, ninfo=0) at client/pmix_client_connect.c:102

4 0x00007f4f006231c2 in pmix2x_connect (procs=0x7fff61344d40) at pmix2x_client.c:1346

5 0x00007f4f037c5793 in ompi_dpm_connect_accept (comm=0x55ceeb4b2520 , root=0, port_string=0x7fff61346157 "619642881.0:622259260",

send_first=true, newcomm=0x7fff613450b0) at dpm/dpm.c:398

6 0x00007f4f0380cca3 in PMPI_Comm_connect (port_name=0x7fff61346157 "619642881.0:622259260", info=0x55ceeb4b2220 , root=0,

comm=0x55ceeb4b2520 <ompi_mpi_comm_self>, newcomm=0x7fff613450f0) at pcomm_connect.c:109

7 0x000055ceeb2b0b6d in main (argc=2, argv=0x7fff613459f8) at client.c:25

(gdb) up

1 __pthread_cond_wait_common (abstime=0x0, mutex=0x55ceec04e0a8, cond=0x55ceec04e0e8) at pthread_cond_wait.c:502

502 pthread_cond_wait.c: No such file or directory. (gdb)

2 __pthread_cond_wait (cond=0x55ceec04e0e8, mutex=0x55ceec04e0a8) at pthread_cond_wait.c:655

655 in pthread_cond_wait.c (gdb)

3 0x00007f4f0066eb72 in OPAL_MCA_PMIX2X_PMIx_Connect (procs=0x55ceec04e7b0, nprocs=2, info=0x0, ninfo=0) at client/pmix_client_connect.c:102

102 PMIX_WAIT_THREAD(&cb->lock); (gdb)

4 0x00007f4f006231c2 in pmix2x_connect (procs=0x7fff61344d40) at pmix2x_client.c:1346

1346 ret = PMIx_Connect(p, nprocs, NULL, 0);

The same exact code works fine in OpenMPI v1.8.8 but hangs in v2.x, 3.x and 4.x. So has something changed?? One thing I've noticed is that on the same machine, the portname string looks very different. 1.8.8 seem to have a much longer string with an ip, portnumber and "tcp" withing but 3.x is something like "0000.0:00000". Is there some configure flag I'm missing?

Here are the exact codes

server.c:

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char *argv[])
{
  char myport[MPI_MAX_PORT_NAME];
  char *msg = NULL;
  MPI_Comm intercomm;
  int len = 0;

  /* Get message if given */
  if (argc > 1) {
    msg = argv[1];
    len = strlen(msg);
  }

  MPI_Init(&argc, &argv);

  MPI_Open_port(MPI_INFO_NULL, myport);

  printf("port name is: %s\n", myport);

  //printf("Publishing ...");
  //MPI_Publish_name("test", MPI_INFO_NULL, myport);
  //  printf("done!\n");
  MPI_Comm_accept(myport, MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercomm);

  /* do something with intercomm */
  printf("Sending msg ...");
  MPI_Send(&len, 1,   MPI_INT, 0, 0, intercomm);
  MPI_Send(msg,  len, MPI_CHAR, 0, 0, intercomm);
  printf(" ... done\n");

  MPI_Finalize();

  return 0;
}

And client.c

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

// Global message
int ARR[3];

int main(int argc, char *argv[])
{
  MPI_Comm intercomm;
  char *name = argv[1];
  char msg[1024];
  int msglen = 0;
  char myport[MPI_MAX_PORT_NAME];

  msg[0] = '\0';

  MPI_Init(&argc, &argv);

  strcpy(myport, name);
  // MPI_Lookup_name("test", MPI_INFO_NULL, myport);
  printf("Got port name as %s\n", myport);

  MPI_Comm_connect(name, MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercomm);
  printf("Waiting to recv ... \n");

  MPI_Recv(&msglen, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, intercomm, MPI_STATUS_IGNORE);
  printf("Got length = %d\n", msglen);

  MPI_Recv(msg, msglen, MPI_CHAR, MPI_ANY_SOURCE, MPI_ANY_TAG, intercomm, MPI_STATUS_IGNORE);
  msg[msglen] = '\0';

  printf("\nMessage:\n");
  printf("  %s\n", msg);
  printf("\n");

  MPI_Finalize();

  return 0;
}
frizwi commented 3 years ago

I just tried this again and seems like its fixed in v4.1.x, but errors/hangs in 4.0.x and 3.1.4. I'll go ahead and update my application to the latest version so happy for this to be closed now.

Sorry, should've done a better job of version testing first before putting the issue in!