An internal error has occurred in ORTE

teonnik commented 6 years ago

Background information

Version

OpenMPI 3.0.0 with CUDA support. I don't know exactly how OpenMPI was installed, I am not the system administrator, I will let you know as soon as I find out.

System

Linux juron1-adm 3.10.0-514.26.2.el7.ppc64le #1 SMP Mon Jul 10 02:18:17 GMT 2017 ppc64le ppc64le ppc64le GNU/Linux

18 IBM S822LC servers ("Minksky") each with

2 IBM POWER8 processors (up to 4.023 GHz, 2*10 cores, 8 threads/core)
4 NVIDIA Tesla P100 GPUs ("Pascal")
4x16 GByte HBM memory attached to GPU
256 GByte DDR4 memory attached to the POWER8 processors
1.6 GByte NVMe SSD

All nodes are connected to a single Mellanox InfiniBand EDR switch.

Details of the problem

I have a C++14 code using CUDA Thrust (the CUDA part is C++11). When I tried to run on multiple nodes, I received the following error:

--------------------------------------------------------------------------
mpirun: Forwarding signal 12 to job
[juronc06.juron.dns.zone:70129] [[1582,0],0] grpcomm:direct:send_relay proc [[1582,0],1] not running - cannot relay: NOT ALIVE 
--------------------------------------------------------------------------
An internal error has occurred in ORTE:

[[1582,0],0] FORCE-TERMINATE AT Unreachable:-12 - error ../../../../../orte/mca/grpcomm/direct/grpcomm_direct.c(548)

This is something that should be reported to the developers.
--------------------------------------------------------------------------

I was trying to execute a test from a library I wrote. The code can be found here.

The cluster uses LSF, I ran with the following command:

bsub -J gvec -n 4 -R "span[ptile=1]" -R "rusage[ngpus_shared=1]" -W 00:01 -q normal -e gvec.err -o gvec.out "mpirun /homeb/padc/padc013/asynchronator/juron/test/gvec

Output from LSF:

Sender: LSF System <lsfadmin@juronc06.juron.dns.zone>
Subject: Job 7650: <gvec> in cluster <juron> Exited

Job <gvec> was submitted from host <juron1-adm> by user <padc013> in cluster <juron>.
Job was executed on host(s) <1*juronc06>, in queue <normal>, as user <padc013> in cluster <juron>.
                            <1*juronc04>
                            <1*juronc07>
                            <1*juronc03>
</gpfs/homeb/padc/padc013> was used as the home directory.
</gpfs/work/padc/padc013/alss> was used as the working directory.
Started at Results reported on 
Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun /homeb/padc/padc013/asynchronator/juron/test/gvec 
------------------------------------------------------------

TERM_RUNLIMIT: job killed after reaching LSF run time limit.
Exited with exit code 244.

Resource usage summary:

    CPU time :                                   0.47 sec.
    Max Memory :                                 25 MB
    Average Memory :                             25.00 MB
    Total Requested Memory :                     -
    Delta Memory :                               -
    Max Swap :                                   -
    Max Processes :                              3
    Max Threads :                                9
    Run time :                                   62 sec.
    Turnaround time :                            64 sec.

The output (if any) is above this job summary.

PS:

Read file <gvec.err> for stderr output of this job.

teonnik commented 6 years ago

The issue does not pertain to OpenMPI, it's rather due to incorrect cluster configuration.

lisalenorelowe commented 4 years ago

We are getting this same error message - can you please tell me how you determined what was wrong with the cluster configuration? We are also using LSF.

open-mpi / ompi