Closed schrummy14 closed 11 months ago
@schrummy14 Thanks for moving the conversatio. You are in the right hands 😄
First of all we need to understand Cannot allocate memory (12)
Could you export FI_LOG_LEVEL=warn
in the job and report the relevant logs?
Based on the ompi issue I imagine you have already included a new-enough libfabric. Could you post the result of git grep EFA_SHM_NAME_MAX
?
A good-to-know thing would be the output of /opt/JARVICE/bin/fi_info -p efa
- to make sure both EFA nics are discoverable.
^^ It's not EFA_SHM_NAME_MAX
, it should be shm_av_size
, you should see it's set as 256 by default
(venv) (venv) [ec2-user@ip-172-31-18-57 libfabric]$ git grep shm_av_size
NEWS.md:- Increase default shm_av_size to 256
prov/efa/src/efa_av.c: if (av->shm_used >= rxr_env.shm_av_size) {
prov/efa/src/efa_av.c: rxr_env.shm_av_size);
prov/efa/src/efa_av.c: assert(peer->shm_fiaddr < rxr_env.shm_av_size);
prov/efa/src/efa_av.c: assert(peer->shm_fiaddr < rxr_env.shm_av_size);
prov/efa/src/efa_av.c: if (rxr_env.shm_av_size > EFA_SHM_MAX_AV_COUNT) {
prov/efa/src/efa_av.c: av_attr.count = rxr_env.shm_av_size;
prov/efa/src/rdm/rxr_env.c: .shm_av_size = 256,
prov/efa/src/rdm/rxr_env.c: fi_param_get_int(&efa_prov, "shm_av_size", &rxr_env.shm_av_size);
prov/efa/src/rdm/rxr_env.c: fi_param_define(&efa_prov, "shm_av_size", FI_PARAM_INT,
prov/efa/src/rdm/rxr_env.h: int shm_av_size;
The other points made by wenduwan@ totally makes sense. Can you run with -x FI_LOG_LEVEL=warn
to get the log why the fi_endpoing failed due to cannot allocate memory.
Looks like there is a difference between the two commands
shm_av_size looks to be set to 128 in 1.18.2 (at least if you run git clone -b v1.18.2 https://github.com/ofiwg/libfabric.git
)
Going to try again setting the branch to 1.18.x which does give the correct 256 value.
git grep shm_av_size
prov/efa/src/efa_av.c: if (av->shm_used >= rxr_env.shm_av_size) {
prov/efa/src/efa_av.c: rxr_env.shm_av_size);
prov/efa/src/efa_av.c: assert(peer->shm_fiaddr < rxr_env.shm_av_size);
prov/efa/src/efa_av.c: assert(peer->shm_fiaddr < rxr_env.shm_av_size);
prov/efa/src/efa_av.c: if (rxr_env.shm_av_size > EFA_SHM_MAX_AV_COUNT) {
prov/efa/src/efa_av.c: av_attr.count = rxr_env.shm_av_size;
prov/efa/src/rdm/rxr_env.c: .shm_av_size = 128,
prov/efa/src/rdm/rxr_env.c: fi_param_get_int(&efa_prov, "shm_av_size", &rxr_env.shm_av_size);
prov/efa/src/rdm/rxr_env.c: fi_param_define(&efa_prov, "shm_av_size", FI_PARAM_INT,
prov/efa/src/rdm/rxr_env.h: int shm_av_size;
git grep EFA_SHM_NAME_MAX
prov/efa/src/efa_av.c: char smr_name[EFA_SHM_NAME_MAX];
prov/efa/src/efa_av.c: smr_name_len = EFA_SHM_NAME_MAX;
prov/efa/src/efa_shm.h:#define EFA_SHM_NAME_MAX (256)
prov/efa/src/rdm/rxr_ep.c: char shm_ep_name[EFA_SHM_NAME_MAX], ep_addr_str[OFI_ADDRSTRLEN];
prov/efa/src/rdm/rxr_ep.c: shm_ep_name_len = EFA_SHM_NAME_MAX;
log info
mpirun -np 192 -hostfile /etc/JARVICE/nodes -x FOAM_SETTINGS --map-by node --mca pml cm --mca mtl ofi /opt/OpenFOAM/OpenFOAM-11/bin/foamExec -prefix /opt/OpenFOAM foamRun -parallel
libfabric:1090:1695321637::efa:ep_ctrl:rxr_ep_ctrl():940<warn> libfabric 1.18.2 efa endpoint created! address: fi_addr_efa://[fe80::40d:68ff:fe4e:c605]:0:1610828443 (Repeated many times)
...
libfabric:2412:1695321654::efa:ep_ctrl:rxr_ep_ctrl():940<warn> libfabric 1.18.2 efa endpoint created! address: fi_addr_efa://[fe80::40d:68ff:fe4e:c605]:254:1163041373
libfabric:2586:1695321654::efa:ep_ctrl:efa_base_ep_create_qp():198<warn> ibv_create_qp failed. errno: 12
libfabric:2586:1695321654::efa:cq:rxr_endpoint():2582<warn> Unable to close shm cq: Device or resource busy
libfabric:3197:1695321654::efa:ep_ctrl:efa_base_ep_create_qp():198<warn> ibv_create_qp failed. errno: 12
libfabric:3197:1695321654::efa:cq:rxr_endpoint():2582<warn> Unable to close shm cq: Device or resource busy
libfabric:3003:1695321654::efa:ep_ctrl:rxr_ep_ctrl():940<warn> libfabric 1.18.2 efa endpoint created! address: fi_addr_efa://[fe80::40d:68ff:fe4e:c605]:255:802381583
libfabric:2586:1695321654::efa:fabric:efa_fabric_close():87<warn> Unable to close fabric: Device or resource busy
libfabric:3197:1695321654::efa:fabric:efa_fabric_close():87<warn> Unable to close fabric: Device or resource busy
libfabric:2342:1695321654::efa:ep_ctrl:efa_base_ep_create_qp():198<warn> ibv_create_qp failed. errno: 12
libfabric:2342:1695321654::efa:cq:rxr_endpoint():2582<warn> Unable to close shm cq: Device or resource busy
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_endpoint). This is highly
unusual; your job may behave unpredictably (and/or abort) after this.
Local host: jarvice-job-83512-xvtjl
Location: mtl_ofi_component.c:509
Error: Cannot allocate memory (12)
--------------------------------------------------------------------------
libfabric:2623:1695321654::efa:ep_ctrl:efa_base_ep_create_qp():198<warn> ibv_create_qp failed. errno: 12
libfabric:2623:1695321654::efa:cq:rxr_endpoint():2582<warn> Unable to close shm cq: Device or resource busy
libfabric:3347:1695321654::efa:ep_ctrl:efa_base_ep_create_qp():198<warn> ibv_create_qp failed. errno: 12
libfabric:3347:1695321654::efa:cq:rxr_endpoint():2582<warn> Unable to close shm cq: Device or resource busy
libfabric:2623:1695321654::efa:fabric:efa_fabric_close():87<warn> Unable to close fabric: Device or resource busy
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.
This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.
Host: jarvice-job-83512-xvtjl
Framework: pml
--------------------------------------------------------------------------
libfabric:2342:1695321654::efa:fabric:efa_fabric_close():87<warn> Unable to close fabric: Device or resource busy
libfabric:3347:1695321654::efa:fabric:efa_fabric_close():87<warn> Unable to close fabric: Device or resource busy
libfabric:2093:1695321654::efa:ep_ctrl:efa_base_ep_create_qp():198<warn> ibv_create_qp failed. errno: 12
libfabric:2093:1695321654::efa:cq:rxr_endpoint():2582<warn> Unable to close shm cq: Device or resource busy
[jarvice-job-83512-xvtjl:02342] PML cm cannot be selected
libfabric:2093:1695321654::efa:fabric:efa_fabric_close():87<warn> Unable to close fabric: Device or resource busy
[jarvice-job-83512-xvtjl:03347] PML cm cannot be selected
[jarvice-job-83512-xvtjl:01081] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198 (repeated many times)
...
[jarvice-job-83512-xvtjl:01081] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
libfabric:1919:1695321654::efa:ep_ctrl:efa_base_ep_create_qp():198<warn> ibv_create_qp failed. errno: 12
libfabric:1919:1695321654::efa:cq:rxr_endpoint():2582<warn> Unable to close shm cq: Device or resource busy
libfabric:1919:1695321654::efa:fabric:efa_fabric_close():87<warn> Unable to close fabric: Device or resource busy
libfabric:1919:1695321654::efa:ep_ctrl:efa_base_ep_create_qp():198<warn> ibv_create_qp failed. errno: 12
libfabric:1919:1695321654::efa:cq:rxr_endpoint():2582<warn> Unable to close shm cq: Device or resource busy (above 5 lines repeated many times)
libfabric:3407:1695321654::efa:ep_ctrl:efa_base_ep_create_qp():198<warn> ibv_create_qp failed. errno: 12
libfabric:3407:1695321654::efa:cq:rxr_endpoint():2582<warn> Unable to close shm cq: Device or resource busy
libfabric:3407:1695321654::efa:fabric:efa_fabric_close():87<warn> Unable to close fabric: Device or resource busy
libfabric:3407:1695321654::efa:ep_ctrl:efa_base_ep_create_qp():198<warn> ibv_create_qp failed. errno: 12
libfabric:3407:1695321654::efa:cq:rxr_endpoint():2582<warn> Unable to close shm cq: Device or resource busy
[jarvice-job-83512-xvtjl:01081] 66 more processes have sent help message help-mtl-ofi.txt / OFI call fail
[jarvice-job-83512-xvtjl:01081] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[jarvice-job-83512-xvtjl:01081] 1 more process has sent help message help-mca-base.txt / find-available:none found
fi_info -p efa
provider: efa
fabric: efa
domain: rdmap36s0-rdm
version: 118.20
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: efa
domain: rdmap36s0-dgrm
version: 118.20
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA
That means you are reaching the ibv queue pair (QP) number limit (256) per EFA NIC. Open MPI will open 2 EFA endpoint (ep) per rank, each ep will cost 1 QP. So if you have > 128 ranks per node that consume 1 EFA NIC (which is your case right now), you will hit this limit.
Did you launch hpc7a with all EFA network interfaces attached? Each hpc7a instance can attach 2 EFA nics. If you do that, I believe the issue will be resolved.
It's not a bug but a known limitation for EFA. I think this issue can be closed if the workaround ^^ works for you
Hello, Thanks for the suggestions. We are only seeing the one efa device. We are looking into why we are not able to attach the second device.
Are you attaching both network interfaces when launching the instance? Please see the example command here for p5 instance which has 32 interfaces. For hpc7a, you would need to do the same but with 2 interfaces.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/p5-efa.html
Hello,
Sorry for the late reply. We have a slightly different setup but long story short, the second nic was not being added. Once we were able to get the second nic enabled, everything is working fine.
Thank you for all of your help.
Describe the bug A clear and concise description of what the bug is. Unable to scale pass 128 cores on a single die
To Reproduce Steps to reproduce the behavior: I built openmpi from source with the following package versions
The following command was used for building openmpi ./autogen.pl && ./configure --with-cma --disable-dlopen --prefix=/opt/JARVICE/openmpi --with-libfabric=/opt/JARVICE --with-hwloc=/opt/JARVICE --with-verbs=/opt/JARVICE --with-verbs-libdir=/opt/JARVICE/lib && make install
I am using openfoam-11 for testing LD_LIBRARY_PATH=/opt/JARVICE/openmpi/lib:$LD_LIBRARY_PATH /opt/JARVICE/openmpi/bin/mpirun -np 192 -hostfile /etc/JARVICE/nodes -x FOAM_SETTINGS --map-by node --mca pml cm --mca mtl ofi /opt/OpenFOAM/OpenFOAM-11/bin/foamExec -prefix /opt/OpenFOAM foamRun -parallel
Expected behavior If needed, a clear and concise description of what you expected to happen. Be able to run on more than 128 cores
Output If applicable, add output to help explain your problem. (e.g. backtrace, debug logs)
Environment: OS (if not Linux), provider, endpoint type, etc. Ubuntu 20.04 Amazon EFA AMD EPYC 9R14
Additional context Add any other context about the problem here. I also made a issue report on openmpi: https://github.com/open-mpi/ompi/issues/11924
Please let me know if there is any other information that you would like me to provide any additional information. Thank you