Open NekoLemon opened 1 year ago
Open MPI v3.1.4 is pretty ancient. Is there any chance you can upgrade to Open MPI v4.1.4? That's the latest release.
Well, I just use sudo apt-get install openmpi-bin
and it installs version v4.1.2
And when I use /usr/bin/mpirun -np 1 vasp_std
the prompt is like
--------------------------------------------------------------------------
Sorry! You were supposed to get help about:
create-qp-failed
But I couldn't open the help file:
/proj/nv/libraries/Linux_x86_64/22.9/openmpi/217647-rel-2/share/openmpi/help -oob-ud.txt: No such file or directory. Sorry!
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry! You were supposed to get help about:
no-ports-usable
But I couldn't open the help file:
/proj/nv/libraries/Linux_x86_64/22.9/openmpi/217647-rel-2/share/openmpi/help -oob-ud.txt: No such file or directory. Sorry!
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry! You were supposed to get help about:
ini file:file not found
But I couldn't open the help file:
/proj/nv/libraries/Linux_x86_64/22.9/openmpi/217647-rel-2/share/openmpi/help -mpi-btl-openib.txt: No such file or directory. Sorry!
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry! You were supposed to get help about:
no device params found
But I couldn't open the help file:
/proj/nv/libraries/Linux_x86_64/22.9/openmpi/217647-rel-2/share/openmpi/help -mpi-btl-openib.txt: No such file or directory. Sorry!
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry! You were supposed to get help about:
no device params found
But I couldn't open the help file:
/proj/nv/libraries/Linux_x86_64/22.9/openmpi/217647-rel-2/share/openmpi/help -mpi-btl-openib.txt: No such file or directory. Sorry!
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry! You were supposed to get help about:
no cpcs for port
But I couldn't open the help file:
/proj/nv/libraries/Linux_x86_64/22.9/openmpi/217647-rel-2/share/openmpi/help -mpi-btl-openib-cpc-base.txt: No such file or directory. Sorry!
--------------------------------------------------------------------------
running on 1 total cores
distrk: each k-point on 1 cores, 1 groups
distr: one band on 1 cores, 1 groups
OpenACC runtime initialized ... 1 GPUs detected
[catlemonx11dai-n:111879] *** Process received signal ***
[catlemonx11dai-n:111879] Signal: Segmentation fault (11)
[catlemonx11dai-n:111879] Signal code: Address not mapped (1)
[catlemonx11dai-n:111879] Failing at address: (nil)
[catlemonx11dai-n:111879] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x15398 1a1a520]
[catlemonx11dai-n:111879] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node catlemonx11dai-n exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
I find a past build that can run normally, but even though there still exists a create-qp-failed error which never occurs before. After looking up some other issues it looks like the openmpi consider I have an InfiniBand connection which doesn't exist at all? And I wonder if I need a latest openmpi when I run the program compile by NVHPC? Would there be any incapability? Thank you very much for your assistance and you could close this issue.
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v3.1.5
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Exists in NVHPC.
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.Please describe the system on which you are running
Details of the problem
I mistakenly update my system from Ubuntu 20.04 to Ubuntu 22.04, and it may update a lot of softwares. And I once used NVHPC to compile VASP, it worked well. But after update it output a lot of error. And cause I'm a newbie to high performance calculation, I can't determine which part is wrong. The CUDA and cuDNN seems working correctly for I can run tensorflow normaly. I've tried to recompile the VASP, but it seems no change. I encounter this error by just using
mpirun -np 1 vasp_std
And it output such:I'm very sorry if I raise the issue in wrong place but could you help me finding which part is wrong so I can ask for help?