open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.16k stars 859 forks source link

openMPI not runnging with Sigansl codes *** End of error message*** #12435

Closed Minyoung-sss closed 5 months ago

Minyoung-sss commented 7 months ago

Hello

I installed openMPI version 4.1.2. and I execute MAKER ver 3.1.2. but it stops immediately with this error (and I executed anaconda3 env name of 'MAKER')

*** end of error message ***
sigterm received
sigterm thread
***process received signal ***
singal : segmentation fault (11)
signal code : Address not mapped (1)
Failing at address: 0x5a4
[ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f26eb242520]
[1 ] / home/kucmb/anaconda3/envs/MAKER/bin/../LIB/PERL5/5.32/core_perl/CORE/libperl.so(Perl_csighandler3+0x38) [0x7f26eb6ff698]
[2 ] /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f26eb242520]
[3 ] /lib/x86_64-linux-gnu/libc.so.6(_poll+0x4f) [0x726eb318bcf]
[4 ] /lib//x86_64-linux-gnu/libevent_core-2.1.so.7(+0x24309) [0x7f26eb0ed309]
[5 ] /lib//x86_64-linux-gnu/libevent_core-2.1.so.7(event_base_loop+0x2a1) [0x7f26eb0e8921]
[6 ] /usr/local/lib/libopen-pal.so.40(+0x37e46) [0x7f26eb4bfe46]
[7 ] /lib//x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f26eb294ac3]
[8 ] /lib//x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f26eb326850]
*** End of error message ***

-------------------------------------------------------------------------------------
Primary job terminated normally. but 1 process returned
a no-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------------------------------
Perl excited with active threads:
             1 running and unjoined
             0 finished and unjoined
             0 running and detached

Screenshot from 2024-03-26 17-07-40

In addition to, when I used this command '--mca btl ^openlib' , this error came out Screenshot from 2024-03-26 17-00-48

What mean? I can't find this error what kind of and causation.

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v4.1.2

I already executed mpirun MAKER using MPI v4.1.6. But running stop immediatly with same error. So I checked already installed version of difference MPI in my computer. I found ubuntu package 'openmpi-bin' and 'openmpi-common' version 4.1.2. I think this is a causation and I changed open MPI downgraded version 4.1.2

Is that right?? I am not good at knowing ubuntu and MPI because I have started studying bioinformatics one month ago.

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

sudo ./configure --prefix=/usr/local --enable-mpirun-prefix-by-default
sudo make
sudo make install
![Screenshot from 2024-03-26 16-26-30](https://github.com/open-mpi/ompi/assets/153480806/9bc786d3-77a3-4428-8b94-b6722ad6d5c3)

https://chat.stackoverflow.com/rooms/153365/discussion-between-imworsethanyou-and-gilles-gouaillardet

vi ~/. bashrc
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so
source ~/.bashrc
export LD_LIBRARY_PATH=/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH

marker install
perl Build.PL
  /bin/mpicc (location of mpicc)
  /usr/local/include (location of mip.h)
install

Details of the problem

shell$ mpirun -np 12 maker maker_opts.ctl maker_exe.ctl maker_bopts.ctl

I don't know why same error appear with running MPI stop Please help me.

Best Regards

Thank you for reading

lrbison commented 7 months ago

The error message suggests you add RDMAV_FORK_SAFE=1. Have you tried adding the following to your mpiexec line:

mpiexec -x RDMAV_FORK_SAFE=1 -np 12 ...

Additionally, your mpiexec output indicates you are using EFA via libfabric, but your config.log output indicates it will not be built with libfabric support. Are you sure you are running the mpi version you think you are?

Minyoung-sss commented 7 months ago

Thank you for your kindly answer.

I will try this command agian. mpiexec -x RDMAV_FORK_SAFE=1 -np 12 ...

but I have questions this command '-x RDMAV_FORK_SAFE=1' Is that mean related to RDMA environment and this is causative to ERROR SINGAL 11? I searched about this error in Google, so I found this error related to coumputer memory and defalut is '0' (not)

and I don't know my mpiexec output using EFA via libfabric before your answering. LOL I don't configure anything about EFA and libfabric. It is right that it will not be with libfabric support.

So, I check running openmpi version again and I confirm the MAKER site which I want to run using open MPI.

$ mpirun --version mpirun (Open MPI) 4.1.2 $ which mpirun /usr/local/bin/mpirun

MAKER program can be used any version open MPI or MPICH.

Do you think I should change my MPI version? or I should build with libfabric suppor?

If I need to re-install different MPI version, how can I remove completely MPI old version? or If I should build with libfavric support , how can I build support?

Thank you for helping rookie, who is lacking a lot

Regards.

Minyoung-sss commented 7 months ago

In addtion to my unbuntu package openmpi version is 4.1.2 Screenshot from 2024-03-27 10-32-09

If I need to re-install different MPI version, remove them also?

I used reference this wepsite when I firstly installed open MPI. so I think this packages need to install MPI.

ggouaillardet commented 7 months ago

Why don't you try the workaround first?

Note you have to use mpirun from the library that was used to build your application.

Minyoung-sss commented 7 months ago

I try this command mpiexec -x RDMAV_FORK_SAFE=1 -np 12 ...

However, same error came out....

(MAKER) kucmb@kucmb-System-Product-Name:~/maker$ mpiexec -x RDMAV_FORK_SAFE=1 -np 12 maker maker_exe.ctl maker_opts.ctl maker_bopts.ctl
STATUS: Parsing control files...
STATUS: Processing and indexing input FASTA files...
[kucmb-System-Product-Name:401470] *** Process received signal ***
[kucmb-System-Product-Name:401470] Signal: Segmentation fault (11)
[kucmb-System-Product-Name:401470] Signal code: Address not mapped (1)
[kucmb-System-Product-Name:401470] Failing at address: 0x5a4
[kucmb-System-Product-Name:401470] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f1e33a42520]
[kucmb-System-Product-Name:401470] [ 1] /home/kucmb/anaconda3/envs/MAKER/bin/../lib/perl5/5.32/core_perl/CORE/libperl.so(Perl_csighandler3+0x38)[0x7f1e33eff698]
[kucmb-System-Product-Name:401470] [ 2] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f1e33a42520]
[kucmb-System-Product-Name:401470] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__poll+0x4f)[0x7f1e33b18bcf]
[kucmb-System-Product-Name:401470] [ 4] /lib/x86_64-linux-gnu/libevent_core-2.1.so.7(+0x24309)[0x7f1e339d3309]
[kucmb-System-Product-Name:401470] [ 5] /lib/x86_64-linux-gnu/libevent_core-2.1.so.7(event_base_loop+0x2a1)[0x7f1e339ce921]
[kucmb-System-Product-Name:401470] [ 6] /lib/x86_64-linux-gnu/libopen-pal.so.40(+0x2d646)[0x7f1e33d7a646]
[kucmb-System-Product-Name:401470] [ 7] /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7f1e33a94ac3]
[kucmb-System-Product-Name:401470] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x126850)[0x7f1e33b26850]
[kucmb-System-Product-Name:401470] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
SIGTERM received
SIGTERM received
SIGTERM received
SIGTERM received
SIGTERM received
SIGTERM received
SIGTERM received
SIGTERM received
SIGTERM received
SIGTERM received
SIGTERM received
--------------------------------------------------------------------------
mpiexec noticed that process rank 11 with PID 0 on node kucmb-System-Product-Name exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

But I tried previous command again by mistake mpiexec -np 12 maker... it is run.....

I don't know why this command run. I haven't changed anything. I'll see if things going on right.... I feel like this run maybe new problem come up......

Thank you so much.

Minyoung-sss commented 7 months ago

Hello. everyone.

My computer have executed this command for 3 days well, but it suddenly stopped at today morning

#-------------------------------#
SIGTERM thread
SIGTERM received
deleted:130 hits
collecting blastn reports
SIGTERM thread
[kucmb-System-Product-Name:402259] *** Process received signal ***
[kucmb-System-Product-Name:402259] Signal: Segmentation fault (11)
[kucmb-System-Product-Name:402259] Signal code: Address not mapped (1)
[kucmb-System-Product-Name:402259] Failing at address: 0x5a4
[kucmb-System-Product-Name:402259] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f7159a42520]
[kucmb-System-Product-Name:402259] [ 1] /home/kucmb/anaconda3/envs/MAKER/bin/../lib/perl5/5.32/core_perl/CORE/libperl.so(Perl_csighandler3+0x38)[0x7f7159eff698]
[kucmb-System-Product-Name:402259] [ 2] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f7159a42520]
[kucmb-System-Product-Name:402259] [ 3] /home/kucmb/anaconda3/envs/MAKER/bin/../lib/perl5/5.32/core_perl/CORE/libperl.so(Perl_csighandler+0x0)[0x7f7159eff710]
[kucmb-System-Product-Name:402259] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f7159a42520]
[kucmb-System-Product-Name:402259] [ 5] /lib/x86_64-linux-gnu/libc.so.6(__poll+0x4f)[0x7f7159b18bcf]
[kucmb-System-Product-Name:402259] [ 6] /lib/x86_64-linux-gnu/libevent_core-2.1.so.7(+0x24309)[0x7f7159c76309]
[kucmb-System-Product-Name:402259] [ 7] /lib/x86_64-linux-gnu/libevent_core-2.1.so.7(event_base_loop+0x2a1)[0x7f7159c71921]
[kucmb-System-Product-Name:402259] [ 8] /lib/x86_64-linux-gnu/libopen-pal.so.40(+0x2d646)[0x7f715a1fc646]
[kucmb-System-Product-Name:402259] [ 9] /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7f7159a94ac3]
[kucmb-System-Product-Name:402259] [10] /lib/x86_64-linux-gnu/libc.so.6(+0x126850)[0x7f7159b26850]
[kucmb-System-Product-Name:402259] *** End of error message ***
running  blast search.

#-------------------------------#
deleted:90 hits
SIGTERM thread
SIGTERM received
--------------------------------------------------------------------------
mpiexec noticed that process rank 8 with PID 0 on node kucmb-System-Product-Name exited on signal 11 (Segmentation fault).

same error again.. Is it a disk capacity problem? In the morning, I got a notification that the capacity was insufficient.

Please give me any help

Thank you

lrbison commented 6 months ago

There is not enough information here to help debug the problem. I suspect you are still mixing installation and runtime versions.

I suggest you do the following:

github-actions[bot] commented 5 months ago

It looks like this issue is expecting a response, but hasn't gotten one yet. If there are no responses in the next 2 weeks, we'll assume that the issue has been abandoned and will close it.

github-actions[bot] commented 5 months ago

Per the above comment, it has been a month with no reply on this issue. It looks like this issue has been abandoned.

I'm going to close this issue. If I'm wrong and this issue is not abandoned, please feel free to re-open it. Thank you!