open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.14k stars 859 forks source link

mpirun spawn - task blocked & process rank X exits ( signal 11 ) #2664

Closed puneet336 closed 6 years ago

puneet336 commented 7 years ago

Hi all, I am facing issues with the lu-mz (class E) benchmark which i had compiled using openmpi 2.0.1(gcc 4.4.7) on Intel(R) Xeon(R) CPU E5-2680 v3 (RHEL 6.5).

[puneet@login ]$ size my-lu-mz.E.8
text       data     bss     dec     hex filename
 125245    1124 24710532016 24710658385 5c0deb951   lu-mz.E.8

So this process is going to use atleast 25GB per rank (i have 64GB of RAM per node ) & i am running this with 8 processes, 1 PPN. I observed that once the memory utilization crosses 41%, the application crashed. here is the standard output:-

 Number of zones:   4 x   4
 Iterations: 300    dt:   0.500000
 Number of active processes:     8

 Use the default load factors with threads
 Total number of threads:    192  ( 24.0 threads/process)

 Calculated speedup =    192.00

--------------------------------------------------------------------------
mpirun noticed that process rank 4 with PID 8669 on node host07 exited on signal 11 (Segmentation fault).
------------------

I have attached the var/log/messages log where i can see messages like:


host03: Jan  5 19:54:13 host03 kernel: my-lu-mz.E.8  D 0000000000000011     0 27981  27958 0x00000080
host03: Jan  5 19:54:13 host03 kernel: ffff8806de6bbc98 0000000000000086 0000000000000000 ffff8806de6bbca0
host03: Jan  5 19:54:13 host03 kernel: ffffc90026e5b1c0 ffff880871381540 ffff8806de6bbc68 ffffffff810b2d90
host03: Jan  5 19:54:13 host03 kernel: ffffc90026e5b1c4 0000000000000000 ffff880871381af8 ffff8806de6bbfd8
host03: Jan  5 19:54:13 host03 kernel: Call Trace:
host03: Jan  5 19:54:13 host03 kernel: [<ffffffff810b2d90>] ? exit_robust_list+0x90/0x160
host03: Jan  5 19:54:13 host03 kernel: [<ffffffff81079aa5>] exit_mm+0x95/0x180
host03: Jan  5 19:54:13 host03 kernel: [<ffffffff81079eef>] do_exit+0x15f/0x870
host03: Jan  5 19:54:13 host03 kernel: [<ffffffff810b235c>] ? wake_futex+0x3c/0x60
host03: Jan  5 19:54:13 host03 kernel: [<ffffffff8107a658>] do_group_exit+0x58/0xd0
host03: Jan  5 19:54:13 host03 kernel: [<ffffffff81090306>] get_signal_to_deliver+0x1f6/0x460

When i compiled class E benchmark with intel compilers, the simulation was fine(45% mem utilization). also, Class D simulation was running fine (compiled with openmpi compiler - 1GB per rank).

[puneet@login ]$ size bin/lu-mz.D.8 
   text    data     bss     dec     hex filename
 124765    1124 1426183792  1426309681  5503c231    bin/lu-mz.D.8

Am i missing here out on some compiler flag settings (already tried with & without mcmodel=medium ) or is it openmpi specific bug?

log1.txt

puneet336 commented 7 years ago

Hi all, This issue is specific to openmpi compilers. I compiled application with various compilers, here are my observations:-

COMPILER ; VERSION ; FLAG ; COMMENT openmpi ; 1.10.0 ; -mcmodel=large ; compilation error openmpi ; 1.10.0 ; without mcmodel/-mcmodel=medium ; compilation OK,simulation crashed - got corefiles openmpi ; 2.0.1 ; -mcmodel=large ; compilation error openmpi ; 1.10.0 ; without mcmodel/-mcmodel=medium ; compilation OK,simulation crashed - got corefiles

mpich ; 3.2 ; without/with mcmodel ; simulation ran fine intel ;2015; without/with mcmodel ; - simulation ran fine

compilation error is as follows;

mpifort -c  -O2 -fPIC -mcmodel=large -fopenmp erhs.f
erhs.f: In function ‘erhs_.omp_fn.0’:
erhs.f:43: error: unrecognizable insn:
(call_insn/u 3652 3651 3653 48 erhs.f:272 (parallel [
            (set (reg:DI 0 ax)
                (call:DI (mem:QI (symbol_ref:DI ("__tls_get_addr")) [0 S1 A8])
                    (const_int 0 [0x0])))
            (unspec:DI [
                    (symbol_ref:DI ("work_1d_") [flags 0x10] <var_decl 0x2b9598e33e60 work_1d>)
                ] 21)
        ]) -1 (expr_list:REG_EH_REGION (const_int -1 [0xffffffffffffffff])
        (nil))
    (nil))
erhs.f:43: internal compiler error: in extract_insn, at recog.c:2078
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://bugzilla.redhat.com/bugzilla> for instructions.
make[2]: *** [erhs.o] Error 1

Eagerly awaiting your replies.

rhc54 commented 7 years ago

Sorry for delay - can you try this with OMPI master (soon to be released as v3.0) and see if the problem continues?

rhc54 commented 6 years ago

please reopen if problem continues