Closed schrummy14 closed 1 year ago
Open MPI v4.x defaults to building a bunch of its functionality as dynamic shared objects (DSOs) -- i.e., plugins. With so many processes hammering your filesystem, individual MPI processes may be timing out or otherwise failing if the filesystem can't handle the load.
In such cases, you might want to build Open MPI without DSO functionality. E.g.:
./configure --disable-dlopen ...
This will build all of Open MPI's functionality into a small number of regular shared libraries, which should significantly reduce the load on your filesystem.
Also, you might want to update to the latest Open MPI v4.1.x. As of this writing, it's v4.1.5.
Hello, I have tried re-building openmpi with the --disable-dlopen flag but I am still experiencing this issue. It did get pass the PML cm cannot be selected error and is now giving the error:
Open MPI failed an OFI Libfabric library call (fi_endpoint). This is highly
unusual; your job may behave unpredictably (and/or abort) after this.
Local host: jarvice-job-81594-7q92s
Location: mtl_ofi_component.c:509
Error: Cannot allocate memory (12)
again, if I use 128 cores or less, everything works but as soon as I go over 128 cores, it fails.
Below are the version of the other dependencies that I am using.
LIBNL3_VER libnl3_7_0
RDMACORE_VER v46.1
LIBFABRIC_VER v1.18.0
HWLOC_VER hwloc-2.8.0
OPENMPI_VER v4.1.5
Thank you
@open-mpi/efa Can you guys chime in here?
@schrummy14 Thanks for reporting the issue! For completeness, could you provide more information on how you installed libfabric? If it was built from source could you check the commit id?
Also could you confirm the instance type, i.e. is AMD EPYC 9R14
on hpc7a.Nxlarge
?
We strongly recommend users to install the software stack following this tutorial https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html
Hello,
Yes, libfabric was built from source (8f3a881e08f56ee685d416436fc87fb6b00af332).
git clone -b v1.18.0 https://github.com/ofiwg/libfabric.git
I did forget about ptrace. I will try setting that option and let you know the results.
The AMD machine is hpc7a.96xlarge
Thank you.
1.18.0 doesn't have a patch that is required to allow MPI jobs run EFA with more than 128 ranks per node. Please try Libfabric 1.18.2 or newer versions.
@schrummy14 Thank you for the information.
Do you have a specific reason to use that libfabric version? As @shijin-aws mentioned the default limit was 128 in 1.18 or older libfabric. We do recommend newer versions if possible. If you want to use 1.18 or older please also set the environment variable FI_EFA_SHM_AV_SIZE=N
where N is the desired # rank per node.
Reference: https://github.com/ofiwg/libfabric/blob/main/man/fi_efa.7.md
Hello,
No specific reason for the 1.18.0. I believe I tried 1.19.0 and that didn't work with something else. So I just reduced the major version. I'll get things recompiled and try 1.18.2.
Thank you very much for the help.
@schrummy14 Cool. In that case I suggest moving the conversation to libfabric community(including issues on 1.19). So far it does not appear open-mpi related.
Sounds good. Thanks again.
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
4.1.2
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Built from source on Ubuntu 20.04
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.Please describe the system on which you are running
Details of the problem
Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.
Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:
When trying to start a program using more than 128 cores (-np 192), openmpi fails with the error:
However, everything runs fine when using 128 cores or less.
By chance, is there an extra flag that needs to be passed during the compilation of openmpi to allow for more than 128 cores on a single cpu?
Please let me know if there is any additional information that you would like me to provide.