open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.13k stars 858 forks source link

Unable to find PML CM when trying to start more than 128 processes on a single CPU #11924

Closed schrummy14 closed 1 year ago

schrummy14 commented 1 year ago

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

4.1.2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Built from source on Ubuntu 20.04

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running


Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:

shell$ mpirun -n 2 ./hello_world

When trying to start a program using more than 128 cores (-np 192), openmpi fails with the error:

--------------------------------------------------------------------------
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      jarvice-job-81316-ntqlx
  Framework: pml
--------------------------------------------------------------------------
[jarvice-job-81316-ntqlx:157396] PML cm cannot be selected
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      jarvice-job-81316-ntqlx
  Framework: pml
--------------------------------------------------------------------------
[warn] Epoll MOD(1) on fd 213 failed.  Old events were 6; read change was 0 (none); write change was 2 (del): Bad file descriptor
[warn] Epoll MOD(4) on fd 213 failed.  Old events were 6; read change was 2 (del); write change was 0 (none): Bad file descriptor

However, everything runs fine when using 128 cores or less.

By chance, is there an extra flag that needs to be passed during the compilation of openmpi to allow for more than 128 cores on a single cpu?

Please let me know if there is any additional information that you would like me to provide.

jsquyres commented 1 year ago

Open MPI v4.x defaults to building a bunch of its functionality as dynamic shared objects (DSOs) -- i.e., plugins. With so many processes hammering your filesystem, individual MPI processes may be timing out or otherwise failing if the filesystem can't handle the load.

In such cases, you might want to build Open MPI without DSO functionality. E.g.:

./configure --disable-dlopen ...

This will build all of Open MPI's functionality into a small number of regular shared libraries, which should significantly reduce the load on your filesystem.

Also, you might want to update to the latest Open MPI v4.1.x. As of this writing, it's v4.1.5.

schrummy14 commented 1 year ago

Hello, I have tried re-building openmpi with the --disable-dlopen flag but I am still experiencing this issue. It did get pass the PML cm cannot be selected error and is now giving the error:

Open MPI failed an OFI Libfabric library call (fi_endpoint).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: jarvice-job-81594-7q92s
  Location: mtl_ofi_component.c:509
  Error: Cannot allocate memory (12)

again, if I use 128 cores or less, everything works but as soon as I go over 128 cores, it fails.

Below are the version of the other dependencies that I am using.

LIBNL3_VER      libnl3_7_0
RDMACORE_VER    v46.1
LIBFABRIC_VER   v1.18.0
HWLOC_VER       hwloc-2.8.0
OPENMPI_VER     v4.1.5

Thank you

jsquyres commented 1 year ago

@open-mpi/efa Can you guys chime in here?

wenduwan commented 1 year ago

@schrummy14 Thanks for reporting the issue! For completeness, could you provide more information on how you installed libfabric? If it was built from source could you check the commit id?

Also could you confirm the instance type, i.e. is AMD EPYC 9R14 on hpc7a.Nxlarge?

We strongly recommend users to install the software stack following this tutorial https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html

schrummy14 commented 1 year ago

Hello, Yes, libfabric was built from source (8f3a881e08f56ee685d416436fc87fb6b00af332). git clone -b v1.18.0 https://github.com/ofiwg/libfabric.git

I did forget about ptrace. I will try setting that option and let you know the results.

The AMD machine is hpc7a.96xlarge

Thank you.

shijin-aws commented 1 year ago

1.18.0 doesn't have a patch that is required to allow MPI jobs run EFA with more than 128 ranks per node. Please try Libfabric 1.18.2 or newer versions.

wenduwan commented 1 year ago

@schrummy14 Thank you for the information.

Do you have a specific reason to use that libfabric version? As @shijin-aws mentioned the default limit was 128 in 1.18 or older libfabric. We do recommend newer versions if possible. If you want to use 1.18 or older please also set the environment variable FI_EFA_SHM_AV_SIZE=N where N is the desired # rank per node.

Reference: https://github.com/ofiwg/libfabric/blob/main/man/fi_efa.7.md

schrummy14 commented 1 year ago

Hello,

No specific reason for the 1.18.0. I believe I tried 1.19.0 and that didn't work with something else. So I just reduced the major version. I'll get things recompiled and try 1.18.2.

Thank you very much for the help.

wenduwan commented 1 year ago

@schrummy14 Cool. In that case I suggest moving the conversation to libfabric community(including issues on 1.19). So far it does not appear open-mpi related.

https://github.com/ofiwg/libfabric/issues

schrummy14 commented 1 year ago

Sounds good. Thanks again.