open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.14k stars 858 forks source link

Failed to create a queue pair (QP) #10923

Open NekoLemon opened 1 year ago

NekoLemon commented 1 year ago

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v3.1.5

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Exists in NVHPC.

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running


Details of the problem

I mistakenly update my system from Ubuntu 20.04 to Ubuntu 22.04, and it may update a lot of softwares. And I once used NVHPC to compile VASP, it worked well. But after update it output a lot of error. And cause I'm a newbie to high performance calculation, I can't determine which part is wrong. The CUDA and cuDNN seems working correctly for I can run tensorflow normaly. I've tried to recompile the VASP, but it seems no change. I encounter this error by just using mpirun -np 1 vasp_std And it output such:

--------------------------------------------------------------------------
Failed to create a queue pair (QP):

Hostname: catlemonx11dai-n
Requested max number of outstanding WRs in the SQ:                1
Requested max number of outstanding WRs in the RQ:                2
Requested max number of SGEs in a WR in the SQ:                   2048
Requested max number of SGEs in a WR in the RQ:                   1024
Requested max number of data that can be posted inline to the SQ: 0
Error:    Operation not supported

Check requested attributes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI has detected that there are UD-capable Verbs devices on your
system, but none of them were able to be setup properly.  This may
indicate a problem on this system.

You job will continue, but Open MPI will ignore the "ud" oob component
in this run.

Hostname: catlemonx11dai-n
--------------------------------------------------------------------------
[catlemonx11dai-n:101473] [[628,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ../../orte/util/show_help.c at line 507
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            catlemonx11dai-n
  Device name:           irdma0
  Device vendor ID:      0x8086
  Device vendor part ID: 14289

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           catlemonx11dai-n
  Local device:         irdma0
  Local port:           1
  CPCs attempted:       rdmacm, udcm
--------------------------------------------------------------------------
 running on    1 total cores
 distrk:  each k-point on    1 cores,    1 groups
 distr:  one band on    1 cores,    1 groups
 OpenACC runtime initialized ...    1 GPUs detected
[catlemonx11dai-n:101477] *** Process received signal ***
[catlemonx11dai-n:101477] Signal: Segmentation fault (11)
[catlemonx11dai-n:101477] Signal code: Address not mapped (1)
[catlemonx11dai-n:101477] Failing at address: (nil)
[catlemonx11dai-n:101477] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x150e2261a520]
[catlemonx11dai-n:101477] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node catlemonx11dai-n exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
[catlemonx11dai-n:101473] 1 more process has sent help message help-oob-ud.txt / create-qp-failed
[catlemonx11dai-n:101473] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[catlemonx11dai-n:101473] 1 more process has sent help message help-mpi-btl-openib.txt / no device params found

I'm very sorry if I raise the issue in wrong place but could you help me finding which part is wrong so I can ask for help?

jsquyres commented 1 year ago

Open MPI v3.1.4 is pretty ancient. Is there any chance you can upgrade to Open MPI v4.1.4? That's the latest release.

NekoLemon commented 1 year ago

Well, I just use sudo apt-get install openmpi-bin and it installs version v4.1.2 And when I use /usr/bin/mpirun -np 1 vasp_std the prompt is like

--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    create-qp-failed
But I couldn't open the help file:
    /proj/nv/libraries/Linux_x86_64/22.9/openmpi/217647-rel-2/share/openmpi/help                                                                                                                                   -oob-ud.txt: No such file or directory.  Sorry!
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    no-ports-usable
But I couldn't open the help file:
    /proj/nv/libraries/Linux_x86_64/22.9/openmpi/217647-rel-2/share/openmpi/help                                                                                                                                   -oob-ud.txt: No such file or directory.  Sorry!
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    ini file:file not found
But I couldn't open the help file:
    /proj/nv/libraries/Linux_x86_64/22.9/openmpi/217647-rel-2/share/openmpi/help                                                                                                                                   -mpi-btl-openib.txt: No such file or directory.  Sorry!
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    no device params found
But I couldn't open the help file:
    /proj/nv/libraries/Linux_x86_64/22.9/openmpi/217647-rel-2/share/openmpi/help                                                                                                                                   -mpi-btl-openib.txt: No such file or directory.  Sorry!
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    no device params found
But I couldn't open the help file:
    /proj/nv/libraries/Linux_x86_64/22.9/openmpi/217647-rel-2/share/openmpi/help                                                                                                                                   -mpi-btl-openib.txt: No such file or directory.  Sorry!
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    no cpcs for port
But I couldn't open the help file:
    /proj/nv/libraries/Linux_x86_64/22.9/openmpi/217647-rel-2/share/openmpi/help                                                                                                                                   -mpi-btl-openib-cpc-base.txt: No such file or directory.  Sorry!
--------------------------------------------------------------------------
 running on    1 total cores
 distrk:  each k-point on    1 cores,    1 groups
 distr:  one band on    1 cores,    1 groups
 OpenACC runtime initialized ...    1 GPUs detected
[catlemonx11dai-n:111879] *** Process received signal ***
[catlemonx11dai-n:111879] Signal: Segmentation fault (11)
[catlemonx11dai-n:111879] Signal code: Address not mapped (1)
[catlemonx11dai-n:111879] Failing at address: (nil)
[catlemonx11dai-n:111879] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x15398                                                                                                                                   1a1a520]
[catlemonx11dai-n:111879] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node catlemonx11dai-n exited on                                                                                                                                    signal 11 (Segmentation fault).
--------------------------------------------------------------------------

I find a past build that can run normally, but even though there still exists a create-qp-failed error which never occurs before. After looking up some other issues it looks like the openmpi consider I have an InfiniBand connection which doesn't exist at all? And I wonder if I need a latest openmpi when I run the program compile by NVHPC? Would there be any incapability? Thank you very much for your assistance and you could close this issue.