open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.08k stars 845 forks source link

Segfault in MPI_Barrier #5478

Open elliottslaughter opened 5 years ago

elliottslaughter commented 5 years ago

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

v3.0.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From the module system on Sherlock 2.0: module load openmpi/3.0.1. If you instruct me what to ask, I can request more details from support.

Please describe the system on which you are running


Details of the problem

Reproducer here: https://github.com/elliottslaughter/mpi-crash-reproducer/blob/master/test.cc

git clone git@github.com:elliottslaughter/mpi-crash-reproducer.git
cd mpi-crash-reproducer
./build.sh
sbatch run.sh

I expect this script to run infinitely (until timeout), but it crashes non-deterministically (about once in every 20 or so iterations) with errors like:

srun: error: sh-105-25: task 0: Segmentation fault
[sh-114-04.int:16904] too many retries sending message to 0x0140:0x0003fd7c, giving up

and:

[sh-105-25][[46211,0],8][../../../../../opal/mca/btl/openib/connect/btl_openib_connect_udcm.c:1521:udcm_find_endpoint] could not find endpoint with port: 1, lid: 334, msg_type: 100
[sh-105-25][[46211,0],8][../../../../../opal/mca/btl/openib/connect/btl_openib_connect_udcm.c:2009:udcm_process_messages] could not find associated endpoint.
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[46211,0],8]) is on host: sh-105-25
  Process 2 ([[46211,0],24]) is on host: unknown!
  BTLs attempted: self tcp openib smcuda vader

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[sh-105-25:124053] *** An error occurred in MPI_Barrier
[sh-105-25:124053] *** reported by process [3028484096,8]
[sh-105-25:124053] *** on communicator MPI_COMM_WORLD
[sh-105-25:124053] *** MPI_ERR_INTERN: internal error
[sh-105-25:124053] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[sh-105-25:124053] ***    and potentially your MPI job)
gpaulsen commented 5 years ago

Could you please ask support if they can send the log of the output of their Open MPI v3.0.1 configure command. Sometimes this can be found in a file called config.log.

Also it appears that you're launching with SLURM's srun. Can you please report the version of SLURM running on the cluster?

Thanks

elliottslaughter commented 5 years ago

SLRUM is on version 17.11.8. Will ask support about the Open MPI build logs.

elliottslaughter commented 5 years ago

Also, the bug does not appear to reproduce with Open MPI 2.1.3 on the same system. (Again, using a copy of Open MPI taken from the module system. I will ask for the configure log for that too.)

elliottslaughter commented 5 years ago

Unfortunately, they didn't keep the build directory, but I was able to get the output of the ompi_info command, in case it's helpful:

v3.0.1: https://github.com/elliottslaughter/mpi-crash-reproducer/blob/master/ompi_info_3.0.1.txt

It does at least contain the configure line.

elliottslaughter commented 5 years ago

Actually, the bug does seem to be happening in Open MPI 2.1.1. The reason it didn't happen with 2.1.3 is that that version was misconfigured and didn't have SLURM integration enabled.

Configuration for 2.1.1: https://github.com/elliottslaughter/mpi-crash-reproducer/blob/master/ompi_info_2.1.1.txt

With 2.1.1, I'm getting this about once in every 30 runs or so:

[sh-105-27][[49718,0],52][../../../../../opal/mca/btl/openib/btl_openib_proc.c:330:mca_btl_openib_proc_get_locked] 52: error exit from mca_btl_openib_proc_create
[sh-105-27][[49718,0],52][../../../../../opal/mca/btl/openib/connect/btl_openib_connect_udcm.c:1521:udcm_find_endpoint] could not find endpoint with port: 1, lid: 319, 
msg_type: 100
[sh-105-27][[49718,0],52][../../../../../opal/mca/btl/openib/connect/btl_openib_connect_udcm.c:2009:udcm_process_messages] could not find associated endpoint.
[sh-105-27][[49718,0],52][../../../../../opal/mca/btl/openib/btl_openib_proc.c:330:mca_btl_openib_proc_get_locked] 52: error exit from mca_btl_openib_proc_create
[sh-105-27][[49718,0],52][../../../../../opal/mca/btl/openib/btl_openib_proc.c:330:mca_btl_openib_proc_get_locked] 52: error exit from mca_btl_openib_proc_create
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[49718,0],52]) is on host: sh-105-27
  Process 2 ([[49718,0],68]) is on host: unknown!
  BTLs attempted: self tcp openib sm vader

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[sh-105-27:50008] *** An error occurred in MPI_Barrier
[sh-105-27:50008] *** reported by process [3258318848,52]
[sh-105-27:50008] *** on communicator MPI_COMM_WORLD
[sh-105-27:50008] *** MPI_ERR_INTERN: internal error
[sh-105-27:50008] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[sh-105-27:50008] ***    and potentially your MPI job)
slurmstepd: error: *** STEP 22725174.29 ON sh-105-25 CANCELLED AT 2018-07-26T13:30:37 ***
jsquyres commented 5 years ago

The first error message you cited, "too many retries sending message to 0x0140:0x0003fd7c, giving up" is here: https://github.com/open-mpi/ompi/blob/master/opal/mca/btl/openib/connect/btl_openib_connect_udcm.c#L2313.

When this is emitted, it means that Open MPI tried (by default) 25 times to send a message and finally gave up. This suggests that your network fabric is either extraordinarily congested or is experiencing errors. Granted, Open MPI was trying to send an unreliable datagram, so there's no guarantee that it will arrive. But if it tried and failed 25 times... that's a bit concerning.

Yes, Open MPI shouldn't be segv'ing, but it's likely that this is an error path that is not well tested (i.e., propagating the "retry failed / giving up" error back up the stack is causing the segv somewhere).

There's a few things you can try.

  1. Check for fabric for layer 0 and 1 errors. I.e., run IB diagnostics and see how the fabric health is doing (I assume you're running over InfiniBand, right?).

  2. Be aware that we're on an eventual path to delete the openib BTL (probably in Open MPI v5.0 -- sometime in 2019 -- see https://github.com/open-mpi/ompi/wiki/5.0.x-FeatureList). If you are, indeed, running InfiniBand, Mellanox recommends using UCX these days. If you can run Open MPI with UCX support, that is Mellanox's supported configuration.

  3. If you need to stay with the openib BTL for now, you can try increasing the connection retry count: mpirun --mca btl_openib_connect_udcm_max_retry 1000 ..., which increases the retry count from 25 to 1000 (YMMV on the exact value).

    • Note that this retry value is only used for the initial connection between MPI processes. It does not affect regular MPI communications. Meaning: it's potentially not a large penalty to increase this value.
  4. You could also tell the openib BTL to not use UD for IB connections: mpirun --mca btl_openib_cpc_include rdmaca, but this presumes that you have the RDMA connection manager infrastructure running (which you might, just by OFED/MOFED defaults...? It's been a long time since I've worked with IB, so I don't know offhand).

elliottslaughter commented 5 years ago

Strangely, the problem appears to go away if I change from using srun to:

mpirun -n 80 -map-by core -bind-to core ./test

Maybe this indicates a SLURM bug? Or at least a bug in how srun is configuring MPI?

Can you suggest any debugging tips I could use to diagnose the issue further?

jsquyres commented 5 years ago

@rhc54 and I talked about this. Your issue sounds familiar to us, but we can't remember any details (i.e., launching with srun causes intermittent errors on IB, but launching with mpirun seems ok). The environment delivered to the MPI processes is slightly different between srun and mpirun, but I don't think we ever fully tracked down this issue -- it's darn weird.

Ralph did notice that when the problem occurs (for srun), we don't know the name of the peer host. Which is... odd. It could be a timing issue of exchange of information during startup...? We're guessing/assuming you're using PMI-1 from SLURM; it might be a racy collective in that plugin...?

You might want to try the PMI-2 plugin, or even better, the PMIx plugin. I don't know the exact option -- check the srun(1) man page -- but perhaps it's something like --mpi pmi2 and/or --mpi pmix...?

elliottslaughter commented 5 years ago

With Open MPI 3.0.1 and --mpi=pmi2 I still errors like the one below. I can't seem to run --mpi=pmix, maybe SLURM wasn't build with the appropriate configure flag. Without root access on the machine I'm not sure that's something I can change.

Would it help to compare srun env and mpirun env?

[sh-114-04][[45240,0],61][../../../../../opal/mca/btl/openib/connect/btl_openib_connect_udcm.c:1521:udcm_find_endpoint] could not find endpoint with port: 1, lid: 322, msg_type: 100
[sh-114-04][[45240,0],61][../../../../../opal/mca/btl/openib/connect/btl_openib_connect_udcm.c:2009:udcm_process_messages] could not find associated endpoint.
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[45240,0],61]) is on host: sh-114-04
  Process 2 ([[45240,0],13]) is on host: unknown!
  BTLs attempted: self tcp openib smcuda vader

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[sh-114-04:66005] *** An error occurred in MPI_Barrier
[sh-114-04:66005] *** reported by process [2964848640,61]
[sh-114-04:66005] *** on communicator MPI_COMM_WORLD
[sh-114-04:66005] *** MPI_ERR_INTERN: internal error
[sh-114-04:66005] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[sh-114-04:66005] ***    and potentially your MPI job)
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 23310520.98 ON sh-105-27 CANCELLED AT 2018-08-08T10:34:56 ***