Open JiajunHuang1999 opened 2 years ago
This seems to be the relevant part:
bdw-0165.11364hfi_userinit_internal: assign_context command failed: Device or resource busy
bdw-0165.11364hfp_gen1_context_open: hfi_userinit_internal: failed, trying again (1/3)
bdw-0165.11364hfi_userinit_internal: assign_context command failed: Device or resource busy
bdw-0165.11364hfp_gen1_context_open: hfi_userinit_internal: failed, trying again (2/3)
bdw-0165.11364hfi_userinit_internal: assign_context command failed: Device or resource busy
bdw-0165.11364hfp_gen1_context_open: hfi_userinit_internal: failed, trying again (3/3)
bdw-0165.11364hfi_userinit_internal: assign_context command failed: Device or resource busy
@BrendanCunningham Is OmniPath your purvue?
This seems to be the relevant part:
bdw-0165.11364hfi_userinit_internal: assign_context command failed: Device or resource busy bdw-0165.11364hfp_gen1_context_open: hfi_userinit_internal: failed, trying again (1/3) bdw-0165.11364hfi_userinit_internal: assign_context command failed: Device or resource busy bdw-0165.11364hfp_gen1_context_open: hfi_userinit_internal: failed, trying again (2/3) bdw-0165.11364hfi_userinit_internal: assign_context command failed: Device or resource busy bdw-0165.11364hfp_gen1_context_open: hfi_userinit_internal: failed, trying again (3/3) bdw-0165.11364hfi_userinit_internal: assign_context command failed: Device or resource busy
@BrendanCunningham Is OmniPath your purvue?
Yes, this is mine. Sorry I didn't notice it before. Will take a look.
@JiajunHuang1999 if your job does not require the Open MPI BTL or internode one-sided communication, can you try your job with -mca btl self,vader
and report if that works?
bdw-0165.11364hfi_userinit_internal: assign_context command failed: Device or resource busy bdw-0165.11364hfp_gen1_context_open: hfi_userinit_internal: failed, trying again (1/3) bdw-0165.11364hfi_userinit_internal: assign_context command failed: Device or resource busy
This message typically occurs when the job requires more HFI contexts per node than are available. By default, there is one HFI context per physical CPU core. Multiple PSM2 processes are able to share one context, allowing more ranks than there are physical CPU cores, but PSM2 context sharing limits each process to one context.
Open MPI v4.1.x built with OFI libfabric support uses the OFI BTL by default. The OFI BTL opens an additional HFI context per process. This disables PSM2 context sharing and can cause jobs to fail like this.
@BrendanCunningham I see, I will try this on the weekdays and give you the feedbacks! Thanks for your quick help!
@BrendanCunningham Hi, when I run the job, I sometimes receive the following error message. I suppose the error is not fully solved.
mpirun -n 64 -mca btl self,vader ./collective_OMPI 16777216
[1] 17538 killed mpirun -n 64 -mca btl self,vader ./collective_OMPI 16777216
Thanks.
@JiajunHuang1999 is there any other output indicating what killed the job? Also what does the 16777216
argument to the job do? Does your job run with 64 ranks but a change to the specific job command? Thanks.
@BrendanCunningham NO, it just get killed. 16777216 is the MPI_Float input array size. I just run the job with 64 ranks and with no specific configurations. Thanks.
@JiajunHuang1999 any updates on this? Any core files produced or output in dmesg
or other system logs showing events around the time the job fails?
I ask about dmesg because in the past, when I've seen jobs die like this without any error output from MPI or the ranks, it has happened when a rank has been killed by oom-killer, the out-of-memory handler.
@BrendanCunningham Still not solved. I will do the experiments again and send the files to you if there are any.
@BrendanCunningham Seems that the same problem appears this time. It is a little random that some tests were successful the first time while failed the second time.
mpirun -n 64 -N 32 -mca btl self,vader ./collective_OMPI 32768
bdw-0214.30734PSM2 can't open hfi unit: -1 (err=23)
--------------------------------------------------------------------------
PSM2 was unable to open an endpoint. Please make sure that the network link is
active on the node and the hardware is functioning.
Error: Failure in initializing endpoint
--------------------------------------------------------------------------
bdw-0214.30740PSM2 can't open hfi unit: -1 (err=23)
bdw-0213.13967PSM2 can't open hfi unit: -1 (err=23)
bdw-0213.13966PSM2 can't open hfi unit: -1 (err=23)
bdw-0213.13972PSM2 can't open hfi unit: -1 (err=23)
bdw-0213.13965PSM2 can't open hfi unit: -1 (err=23)
bdw-0214.30735PSM2 can't open hfi unit: -1 (err=23)
......
bdw-0213.14030PSM2 can't open hfi unit: -1 (err=23)
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
PML add procs failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[bdw-0213:13994] *** An error occurred in MPI_Init
[bdw-0213:13994] *** reported by process [2870149121,23]
[bdw-0213:13994] *** on a NULL communicator
[bdw-0213:13994] *** Unknown error
[bdw-0213:13994] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[bdw-0213:13994] *** and potentially your MPI job)
[beboplogin3:12682] 111 more processes have sent help message help-mtl-psm2.txt / unable to open endpoint
[beboplogin3:12682] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[beboplogin3:12682] 63 more processes have sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
[beboplogin3:12682] 34 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
@JiajunHuang1999
I think we're going to need more information to help you with this problem, specifically an opcapture that will get the fabric configuration and software versions.
To facilitate data requests and file sharing, we can utilize our support system. To do that, you can open a support case by sending an email to → support@cornelisnetworks.com with the subject An error occurred in MPI_Init (seems the network initialization has some bugs) #10601
. Support will provide instructions on what data they need.
The channel with our customer support will just be for certain data. I'll continue to work with you on this issue through GitHub.
@JiajunHuang1999 any update?
@BrendanCunningham I have been busy recently and have not been able to send that email. I will send one in a couple of days. Thanks for following up!
@BrendanCunningham I have sent the email. Hope to hear from you soon. Thanks!!!
Is there any update on that?
There seem to be some bugs with the openmpi-4.1.4 when it is used with slurm 17.11.7 and intel Omni-path. My CPUs are Intel Xeon E5-2695v4 (Broadwell Nodes with 15 GB /scratch). When I use
srun --mpi=pmi2 -n 37 -p bdw ./my_program
the following errors would appear:I searched the internet and find that someone has already found the similar problems when they are using Broadwell nodes, InfiniBand, SLURM 18.08.3 and open-mpi. Here is the link: https://bugs.schedmd.com/show_bug.cgi?id=5956. The last message in the ticket suggests that they are contacting the open-mpi team for solving the problem. However, I doubt that the open-mpi team has solved the bug. It seems to be working for 36 processes. I think it may be because for processes less than 36, they are all in the same node as the Intel Xeon E5-2695v4 has 36 cores per CPU.