An error occurred in MPI_Init (seems the network initialization has some bugs)

JiajunHuang1999 commented 2 years ago

There seem to be some bugs with the openmpi-4.1.4 when it is used with slurm 17.11.7 and intel Omni-path. My CPUs are Intel Xeon E5-2695v4 (Broadwell Nodes with 15 GB /scratch). When I use srun --mpi=pmi2 -n 37 -p bdw ./my_program the following errors would appear:

[bdw-0165:11363] *** An error occurred in MPI_Init
[bdw-0165:11363] *** reported by process [1958739981,3]
[bdw-0165:11363] *** on a NULL communicator
[bdw-0165:11363] *** Unknown error
[bdw-0165:11363] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[bdw-0165:11363] ***    and potentially your MPI job)
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[bdw-0165:11364] *** An error occurred in MPI_Init
[bdw-0165:11364] *** reported by process [18446744071373324301,4]
[bdw-0165:11364] *** on a NULL communicator
[bdw-0165:11364] *** Unknown error
[bdw-0165:11364] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[bdw-0165:11364] ***    and potentially your MPI job)
[bdw-0165:11366] *** An error occurred in MPI_Init
[bdw-0165:11366] *** reported by process [18446744071373324301,6]
[bdw-0165:11366] *** on a NULL communicator
[bdw-0165:11366] *** Unknown error
[bdw-0165:11366] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[bdw-0165:11366] ***    and potentially your MPI job)
[bdw-0165:11368] *** An error occurred in MPI_Init
[bdw-0165:11368] *** reported by process [18446744071373324301,8]
[bdw-0165:11368] *** on a NULL communicator
[bdw-0165:11368] *** Unknown error
[bdw-0165:11368] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[bdw-0165:11368] ***    and potentially your MPI job)
bdw-0165.11364hfi_userinit_internal: assign_context command failed: Device or resource busy
bdw-0165.11364hfp_gen1_context_open: hfi_userinit_internal: failed, trying again (1/3)
bdw-0165.11364hfi_userinit_internal: assign_context command failed: Device or resource busy
bdw-0165.11364hfp_gen1_context_open: hfi_userinit_internal: failed, trying again (2/3)
bdw-0165.11364hfi_userinit_internal: assign_context command failed: Device or resource busy
bdw-0165.11364hfp_gen1_context_open: hfi_userinit_internal: failed, trying again (3/3)
bdw-0165.11364hfi_userinit_internal: assign_context command failed: Device or resource busy
slurmstepd: error: *** STEP 2520256.13 ON bdw-0165 CANCELLED AT 2022-07-23T18:15:04 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: bdw-0165: tasks 0-4,6,8: Exited with exit code 1
srun: error: bdw-0165: tasks 5,7,9-18: Killed
srun: error: bdw-0175: tasks 19-36: Killed

I searched the internet and find that someone has already found the similar problems when they are using Broadwell nodes, InfiniBand, SLURM 18.08.3 and open-mpi. Here is the link: https://bugs.schedmd.com/show_bug.cgi?id=5956. The last message in the ticket suggests that they are contacting the open-mpi team for solving the problem. However, I doubt that the open-mpi team has solved the bug. It seems to be working for 36 processes. I think it may be because for processes less than 36, they are all in the same node as the Intel Xeon E5-2695v4 has 36 cores per CPU.

jsquyres commented 2 years ago

This seems to be the relevant part:

bdw-0165.11364hfi_userinit_internal: assign_context command failed: Device or resource busy
bdw-0165.11364hfp_gen1_context_open: hfi_userinit_internal: failed, trying again (1/3)
bdw-0165.11364hfi_userinit_internal: assign_context command failed: Device or resource busy
bdw-0165.11364hfp_gen1_context_open: hfi_userinit_internal: failed, trying again (2/3)
bdw-0165.11364hfi_userinit_internal: assign_context command failed: Device or resource busy
bdw-0165.11364hfp_gen1_context_open: hfi_userinit_internal: failed, trying again (3/3)
bdw-0165.11364hfi_userinit_internal: assign_context command failed: Device or resource busy

@BrendanCunningham Is OmniPath your purvue?

BrendanCunningham commented 2 years ago

This seems to be the relevant part:

bdw-0165.11364hfi_userinit_internal: assign_context command failed: Device or resource busy
bdw-0165.11364hfp_gen1_context_open: hfi_userinit_internal: failed, trying again (1/3)
bdw-0165.11364hfi_userinit_internal: assign_context command failed: Device or resource busy
bdw-0165.11364hfp_gen1_context_open: hfi_userinit_internal: failed, trying again (2/3)
bdw-0165.11364hfi_userinit_internal: assign_context command failed: Device or resource busy
bdw-0165.11364hfp_gen1_context_open: hfi_userinit_internal: failed, trying again (3/3)
bdw-0165.11364hfi_userinit_internal: assign_context command failed: Device or resource busy

@BrendanCunningham Is OmniPath your purvue?

Yes, this is mine. Sorry I didn't notice it before. Will take a look.

BrendanCunningham commented 2 years ago

@JiajunHuang1999 if your job does not require the Open MPI BTL or internode one-sided communication, can you try your job with -mca btl self,vader and report if that works?

bdw-0165.11364hfi_userinit_internal: assign_context command failed: Device or resource busy bdw-0165.11364hfp_gen1_context_open: hfi_userinit_internal: failed, trying again (1/3) bdw-0165.11364hfi_userinit_internal: assign_context command failed: Device or resource busy

This message typically occurs when the job requires more HFI contexts per node than are available. By default, there is one HFI context per physical CPU core. Multiple PSM2 processes are able to share one context, allowing more ranks than there are physical CPU cores, but PSM2 context sharing limits each process to one context.

Open MPI v4.1.x built with OFI libfabric support uses the OFI BTL by default. The OFI BTL opens an additional HFI context per process. This disables PSM2 context sharing and can cause jobs to fail like this.

JiajunHuang1999 commented 2 years ago

@BrendanCunningham I see, I will try this on the weekdays and give you the feedbacks! Thanks for your quick help!

JiajunHuang1999 commented 2 years ago

@BrendanCunningham Hi, when I run the job, I sometimes receive the following error message. I suppose the error is not fully solved.

mpirun -n 64 -mca btl self,vader ./collective_OMPI 16777216
[1]    17538 killed     mpirun -n 64 -mca btl self,vader ./collective_OMPI 16777216

Thanks.

BrendanCunningham commented 2 years ago

@JiajunHuang1999 is there any other output indicating what killed the job? Also what does the 16777216 argument to the job do? Does your job run with 64 ranks but a change to the specific job command? Thanks.

JiajunHuang1999 commented 2 years ago

@BrendanCunningham NO, it just get killed. 16777216 is the MPI_Float input array size. I just run the job with 64 ranks and with no specific configurations. Thanks.

BrendanCunningham commented 2 years ago

@JiajunHuang1999 any updates on this? Any core files produced or output in dmesg or other system logs showing events around the time the job fails?

I ask about dmesg because in the past, when I've seen jobs die like this without any error output from MPI or the ranks, it has happened when a rank has been killed by oom-killer, the out-of-memory handler.

JiajunHuang1999 commented 2 years ago

@BrendanCunningham Still not solved. I will do the experiments again and send the files to you if there are any.

JiajunHuang1999 commented 2 years ago

@BrendanCunningham Seems that the same problem appears this time. It is a little random that some tests were successful the first time while failed the second time. mpirun -n 64 -N 32 -mca btl self,vader ./collective_OMPI 32768

bdw-0214.30734PSM2 can't open hfi unit: -1 (err=23)
--------------------------------------------------------------------------
PSM2 was unable to open an endpoint. Please make sure that the network link is
active on the node and the hardware is functioning.

  Error: Failure in initializing endpoint
--------------------------------------------------------------------------
bdw-0214.30740PSM2 can't open hfi unit: -1 (err=23)
bdw-0213.13967PSM2 can't open hfi unit: -1 (err=23)
bdw-0213.13966PSM2 can't open hfi unit: -1 (err=23)
bdw-0213.13972PSM2 can't open hfi unit: -1 (err=23)
bdw-0213.13965PSM2 can't open hfi unit: -1 (err=23)
bdw-0214.30735PSM2 can't open hfi unit: -1 (err=23)
......
bdw-0213.14030PSM2 can't open hfi unit: -1 (err=23)
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[bdw-0213:13994] *** An error occurred in MPI_Init
[bdw-0213:13994] *** reported by process [2870149121,23]
[bdw-0213:13994] *** on a NULL communicator
[bdw-0213:13994] *** Unknown error
[bdw-0213:13994] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[bdw-0213:13994] ***    and potentially your MPI job)
[beboplogin3:12682] 111 more processes have sent help message help-mtl-psm2.txt / unable to open endpoint
[beboplogin3:12682] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[beboplogin3:12682] 63 more processes have sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
[beboplogin3:12682] 34 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle

BrendanCunningham commented 2 years ago

@JiajunHuang1999

I think we're going to need more information to help you with this problem, specifically an opcapture that will get the fabric configuration and software versions.

To facilitate data requests and file sharing, we can utilize our support system. To do that, you can open a support case by sending an email to → support@cornelisnetworks.com with the subject An error occurred in MPI_Init (seems the network initialization has some bugs) #10601. Support will provide instructions on what data they need.

The channel with our customer support will just be for certain data. I'll continue to work with you on this issue through GitHub.

BrendanCunningham commented 2 years ago

@JiajunHuang1999 any update?

JiajunHuang1999 commented 2 years ago

@BrendanCunningham I have been busy recently and have not been able to send that email. I will send one in a couple of days. Thanks for following up!

JiajunHuang1999 commented 2 years ago

@BrendanCunningham I have sent the email. Hope to hear from you soon. Thanks!!!

nikosT commented 2 months ago

Is there any update on that?

open-mpi / ompi

An error occurred in MPI_Init (seems the network initialization has some bugs) #10601