ofiwg / libfabric

Open Fabric Interfaces
http://libfabric.org/
Other
549 stars 375 forks source link

Unable to allocate memory when using CPU with more than 128 cores #9356

Closed schrummy14 closed 11 months ago

schrummy14 commented 12 months ago

Describe the bug A clear and concise description of what the bug is. Unable to scale pass 128 cores on a single die

To Reproduce Steps to reproduce the behavior: I built openmpi from source with the following package versions

LIBNL3_VER      libnl3_7_0
RDMACORE_VER    v46.1
LIBFABRIC_VER   v1.18.2
HWLOC_VER       hwloc-2.8.0
OPENMPI_VER     v4.1.5

The following command was used for building openmpi ./autogen.pl && ./configure --with-cma --disable-dlopen --prefix=/opt/JARVICE/openmpi --with-libfabric=/opt/JARVICE --with-hwloc=/opt/JARVICE --with-verbs=/opt/JARVICE --with-verbs-libdir=/opt/JARVICE/lib && make install

I am using openfoam-11 for testing LD_LIBRARY_PATH=/opt/JARVICE/openmpi/lib:$LD_LIBRARY_PATH /opt/JARVICE/openmpi/bin/mpirun -np 192 -hostfile /etc/JARVICE/nodes -x FOAM_SETTINGS --map-by node --mca pml cm --mca mtl ofi /opt/OpenFOAM/OpenFOAM-11/bin/foamExec -prefix /opt/OpenFOAM foamRun -parallel

Expected behavior If needed, a clear and concise description of what you expected to happen. Be able to run on more than 128 cores

Output If applicable, add output to help explain your problem. (e.g. backtrace, debug logs)

Open MPI failed an OFI Libfabric library call (fi_endpoint).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: jarvice-job-81594-7q92s
  Location: mtl_ofi_component.c:509
  Error: Cannot allocate memory (12)

Environment: OS (if not Linux), provider, endpoint type, etc. Ubuntu 20.04 Amazon EFA AMD EPYC 9R14

Additional context Add any other context about the problem here. I also made a issue report on openmpi: https://github.com/open-mpi/ompi/issues/11924

Please let me know if there is any other information that you would like me to provide any additional information. Thank you

wenduwan commented 12 months ago

@schrummy14 Thanks for moving the conversatio. You are in the right hands 😄

First of all we need to understand Cannot allocate memory (12)

Could you export FI_LOG_LEVEL=warn in the job and report the relevant logs?

Based on the ompi issue I imagine you have already included a new-enough libfabric. Could you post the result of git grep EFA_SHM_NAME_MAX?

A good-to-know thing would be the output of /opt/JARVICE/bin/fi_info -p efa - to make sure both EFA nics are discoverable.

shijin-aws commented 12 months ago

^^ It's not EFA_SHM_NAME_MAX, it should be shm_av_size, you should see it's set as 256 by default

(venv) (venv) [ec2-user@ip-172-31-18-57 libfabric]$ git grep shm_av_size
NEWS.md:- Increase default shm_av_size to 256
prov/efa/src/efa_av.c:          if (av->shm_used >= rxr_env.shm_av_size) {
prov/efa/src/efa_av.c:                           rxr_env.shm_av_size);
prov/efa/src/efa_av.c:          assert(peer->shm_fiaddr < rxr_env.shm_av_size);
prov/efa/src/efa_av.c:                  assert(peer->shm_fiaddr < rxr_env.shm_av_size);
prov/efa/src/efa_av.c:                  if (rxr_env.shm_av_size > EFA_SHM_MAX_AV_COUNT) {
prov/efa/src/efa_av.c:                  av_attr.count = rxr_env.shm_av_size;
prov/efa/src/rdm/rxr_env.c:     .shm_av_size = 256,
prov/efa/src/rdm/rxr_env.c:     fi_param_get_int(&efa_prov, "shm_av_size", &rxr_env.shm_av_size);
prov/efa/src/rdm/rxr_env.c:     fi_param_define(&efa_prov, "shm_av_size", FI_PARAM_INT,
prov/efa/src/rdm/rxr_env.h:     int shm_av_size;

The other points made by wenduwan@ totally makes sense. Can you run with -x FI_LOG_LEVEL=warn to get the log why the fi_endpoing failed due to cannot allocate memory.

schrummy14 commented 12 months ago

Looks like there is a difference between the two commands shm_av_size looks to be set to 128 in 1.18.2 (at least if you run git clone -b v1.18.2 https://github.com/ofiwg/libfabric.git) Going to try again setting the branch to 1.18.x which does give the correct 256 value.

git grep shm_av_size
prov/efa/src/efa_av.c:          if (av->shm_used >= rxr_env.shm_av_size) {
prov/efa/src/efa_av.c:                           rxr_env.shm_av_size);
prov/efa/src/efa_av.c:          assert(peer->shm_fiaddr < rxr_env.shm_av_size);
prov/efa/src/efa_av.c:                  assert(peer->shm_fiaddr < rxr_env.shm_av_size);
prov/efa/src/efa_av.c:                  if (rxr_env.shm_av_size > EFA_SHM_MAX_AV_COUNT) {
prov/efa/src/efa_av.c:                  av_attr.count = rxr_env.shm_av_size;
prov/efa/src/rdm/rxr_env.c:     .shm_av_size = 128,
prov/efa/src/rdm/rxr_env.c:     fi_param_get_int(&efa_prov, "shm_av_size", &rxr_env.shm_av_size);
prov/efa/src/rdm/rxr_env.c:     fi_param_define(&efa_prov, "shm_av_size", FI_PARAM_INT,
prov/efa/src/rdm/rxr_env.h:     int shm_av_size;
git grep EFA_SHM_NAME_MAX
prov/efa/src/efa_av.c:  char smr_name[EFA_SHM_NAME_MAX];
prov/efa/src/efa_av.c:          smr_name_len = EFA_SHM_NAME_MAX;
prov/efa/src/efa_shm.h:#define EFA_SHM_NAME_MAX    (256)
prov/efa/src/rdm/rxr_ep.c:      char shm_ep_name[EFA_SHM_NAME_MAX], ep_addr_str[OFI_ADDRSTRLEN];
prov/efa/src/rdm/rxr_ep.c:                      shm_ep_name_len = EFA_SHM_NAME_MAX;

log info

mpirun -np 192 -hostfile /etc/JARVICE/nodes -x FOAM_SETTINGS --map-by node --mca pml cm --mca mtl ofi /opt/OpenFOAM/OpenFOAM-11/bin/foamExec -prefix /opt/OpenFOAM foamRun -parallel
libfabric:1090:1695321637::efa:ep_ctrl:rxr_ep_ctrl():940<warn> libfabric 1.18.2 efa endpoint created! address: fi_addr_efa://[fe80::40d:68ff:fe4e:c605]:0:1610828443 (Repeated many times)
...
libfabric:2412:1695321654::efa:ep_ctrl:rxr_ep_ctrl():940<warn> libfabric 1.18.2 efa endpoint created! address: fi_addr_efa://[fe80::40d:68ff:fe4e:c605]:254:1163041373
libfabric:2586:1695321654::efa:ep_ctrl:efa_base_ep_create_qp():198<warn> ibv_create_qp failed. errno: 12
libfabric:2586:1695321654::efa:cq:rxr_endpoint():2582<warn> Unable to close shm cq: Device or resource busy
libfabric:3197:1695321654::efa:ep_ctrl:efa_base_ep_create_qp():198<warn> ibv_create_qp failed. errno: 12
libfabric:3197:1695321654::efa:cq:rxr_endpoint():2582<warn> Unable to close shm cq: Device or resource busy
libfabric:3003:1695321654::efa:ep_ctrl:rxr_ep_ctrl():940<warn> libfabric 1.18.2 efa endpoint created! address: fi_addr_efa://[fe80::40d:68ff:fe4e:c605]:255:802381583
libfabric:2586:1695321654::efa:fabric:efa_fabric_close():87<warn> Unable to close fabric: Device or resource busy
libfabric:3197:1695321654::efa:fabric:efa_fabric_close():87<warn> Unable to close fabric: Device or resource busy
libfabric:2342:1695321654::efa:ep_ctrl:efa_base_ep_create_qp():198<warn> ibv_create_qp failed. errno: 12
libfabric:2342:1695321654::efa:cq:rxr_endpoint():2582<warn> Unable to close shm cq: Device or resource busy
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_endpoint).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: jarvice-job-83512-xvtjl
  Location: mtl_ofi_component.c:509
  Error: Cannot allocate memory (12)
--------------------------------------------------------------------------
libfabric:2623:1695321654::efa:ep_ctrl:efa_base_ep_create_qp():198<warn> ibv_create_qp failed. errno: 12
libfabric:2623:1695321654::efa:cq:rxr_endpoint():2582<warn> Unable to close shm cq: Device or resource busy
libfabric:3347:1695321654::efa:ep_ctrl:efa_base_ep_create_qp():198<warn> ibv_create_qp failed. errno: 12
libfabric:3347:1695321654::efa:cq:rxr_endpoint():2582<warn> Unable to close shm cq: Device or resource busy
libfabric:2623:1695321654::efa:fabric:efa_fabric_close():87<warn> Unable to close fabric: Device or resource busy
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      jarvice-job-83512-xvtjl
  Framework: pml
--------------------------------------------------------------------------
libfabric:2342:1695321654::efa:fabric:efa_fabric_close():87<warn> Unable to close fabric: Device or resource busy
libfabric:3347:1695321654::efa:fabric:efa_fabric_close():87<warn> Unable to close fabric: Device or resource busy
libfabric:2093:1695321654::efa:ep_ctrl:efa_base_ep_create_qp():198<warn> ibv_create_qp failed. errno: 12
libfabric:2093:1695321654::efa:cq:rxr_endpoint():2582<warn> Unable to close shm cq: Device or resource busy
[jarvice-job-83512-xvtjl:02342] PML cm cannot be selected
libfabric:2093:1695321654::efa:fabric:efa_fabric_close():87<warn> Unable to close fabric: Device or resource busy
[jarvice-job-83512-xvtjl:03347] PML cm cannot be selected
[jarvice-job-83512-xvtjl:01081] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198 (repeated many times)
...
[jarvice-job-83512-xvtjl:01081] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
libfabric:1919:1695321654::efa:ep_ctrl:efa_base_ep_create_qp():198<warn> ibv_create_qp failed. errno: 12
libfabric:1919:1695321654::efa:cq:rxr_endpoint():2582<warn> Unable to close shm cq: Device or resource busy
libfabric:1919:1695321654::efa:fabric:efa_fabric_close():87<warn> Unable to close fabric: Device or resource busy
libfabric:1919:1695321654::efa:ep_ctrl:efa_base_ep_create_qp():198<warn> ibv_create_qp failed. errno: 12
libfabric:1919:1695321654::efa:cq:rxr_endpoint():2582<warn> Unable to close shm cq: Device or resource busy (above 5 lines repeated many times)
libfabric:3407:1695321654::efa:ep_ctrl:efa_base_ep_create_qp():198<warn> ibv_create_qp failed. errno: 12
libfabric:3407:1695321654::efa:cq:rxr_endpoint():2582<warn> Unable to close shm cq: Device or resource busy
libfabric:3407:1695321654::efa:fabric:efa_fabric_close():87<warn> Unable to close fabric: Device or resource busy
libfabric:3407:1695321654::efa:ep_ctrl:efa_base_ep_create_qp():198<warn> ibv_create_qp failed. errno: 12
libfabric:3407:1695321654::efa:cq:rxr_endpoint():2582<warn> Unable to close shm cq: Device or resource busy
[jarvice-job-83512-xvtjl:01081] 66 more processes have sent help message help-mtl-ofi.txt / OFI call fail
[jarvice-job-83512-xvtjl:01081] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[jarvice-job-83512-xvtjl:01081] 1 more process has sent help message help-mca-base.txt / find-available:none found
fi_info -p efa
provider: efa
    fabric: efa
    domain: rdmap36s0-rdm
    version: 118.20
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: efa
    domain: rdmap36s0-dgrm
    version: 118.20
    type: FI_EP_DGRAM
    protocol: FI_PROTO_EFA
shijin-aws commented 12 months ago

That means you are reaching the ibv queue pair (QP) number limit (256) per EFA NIC. Open MPI will open 2 EFA endpoint (ep) per rank, each ep will cost 1 QP. So if you have > 128 ranks per node that consume 1 EFA NIC (which is your case right now), you will hit this limit.

Did you launch hpc7a with all EFA network interfaces attached? Each hpc7a instance can attach 2 EFA nics. If you do that, I believe the issue will be resolved.

shijin-aws commented 11 months ago

It's not a bug but a known limitation for EFA. I think this issue can be closed if the workaround ^^ works for you

schrummy14 commented 11 months ago

Hello, Thanks for the suggestions. We are only seeing the one efa device. We are looking into why we are not able to attach the second device.

sunkuamzn commented 11 months ago

Are you attaching both network interfaces when launching the instance? Please see the example command here for p5 instance which has 32 interfaces. For hpc7a, you would need to do the same but with 2 interfaces.

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/p5-efa.html

schrummy14 commented 11 months ago

Hello,

Sorry for the late reply. We have a slightly different setup but long story short, the second nic was not being added. Once we were able to get the second nic enabled, everything is working fine.

Thank you for all of your help.