Open amirsojoodi opened 1 year ago
Also, after applying the solution discussed here:
Adding
LIBS="-lucm -lucs"
to ompi configure command.
the issue persists
Interestingly, setting LDFLAGS before running configure
resolved the problem.
Shouldn't it automatically look in this directory for libs? 🤔
export LDFLAGS="-L$BUILD_DIR/lib"
./configure --prefix=$BUILD_DIR \
--disable-io-romio \
--disable-io-ompio \
--disable-mpi-fortran \
--disable-oshmem \
--enable-mca-no-build=btl-portals4,coll-hcoll \
--with-cuda=$CUDA_HOME \
--with-devel-headers \
--with-hwloc=internal \
--with-libevent=internal \
--with-pmix=internal \
--with-prrte=internal \
--enable-mca-dso=coll-cuda\
--enable-mca-static=coll-cuda\
--with-ucx=$BUILD_DIR 2>&1 | tee config-release.out
Shouldn't it automatically look in this directory for libs? 🤔
Yes.
@open-mpi/ucx please have a look.
@amirsojoodi i've tried the above commands and it worked ok for me (on CentOS 7.9) can you pls post the output of
cd ompi
grep pml_ucx config.status
@yosefe: Thanks for the follow up.
$ grep pml_ucx config.status
S["MCA_oshmem_spml_STATIC_LTLIBS"]="mca/spml/ucx/libmca_spml_ucx.la "
S["MCA_BUILD_oshmem_spml_ucx_DSO_FALSE"]=""
S["MCA_BUILD_oshmem_spml_ucx_DSO_TRUE"]="#"
S["spml_ucx_LIBS"]="-lucp -luct -lucs -lucm "
S["spml_ucx_LDFLAGS"]=""
S["spml_ucx_CPPFLAGS"]=""
S["MCA_ompi_pml_STATIC_LTLIBS"]="mca/pml/v/libmca_pml_v.la mca/pml/ucx/libmca_pml_ucx.la mca/pml/ob1/libmca_pml_ob1.la mca/pml/cm/libmca_pml_cm.la "
S["MCA_BUILD_ompi_pml_ucx_DSO_FALSE"]=""
S["MCA_BUILD_ompi_pml_ucx_DSO_TRUE"]="#"
S["pml_ucx_LIBS"]="-lucp -luct -lucs -lucm "
S["pml_ucx_LDFLAGS"]=""
S["pml_ucx_CPPFLAGS"]=""
I am on a PowerPC machine with RedHat 8.4
BTW, even now that I can build ompi, I have weird Segfaults. I'll post them in the next comment.
$ mpirun --mca pml ucx --mca btl ^smcuda,vader,openib,uct \
-x UCX_TLS=rc,sm,cuda_copy,gdr_copy,cuda_ipc \
-np 2 $BUILD_DIR/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw \
--window-size 1 -m 2097152:67108864 H H
[mist-login01:4176932:0:4176932] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x11)
==== backtrace (tid:4176932) ====
=================================
--------------------------------------------------------------------------
prterun noticed that process rank 1 with PID 0 on node mist-login01 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
The error doesn't change with CUDA benchmarks. It seems there is a problem in MPI_Init. I tried MPI_THREAD_SINGLE
, too, but no luck. Also, I rebuilt ucx without --enable-mt
, no luck again.
However, changing pml from ucx
to ob1
somehow works.
mpirun --mca pml ob1 --mca btl '^vader,tcp,openib,uct' -np 2 \
/project/q/queenspp/sojoodi/OpenMPI-Release/build/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw \
--window-size 1 -m 2097152:67108864 H H
@amirsojoodi i've tried on Centos 8.4 and it works fine for me. Maybe there is an older version of UCX installed on your system? Can you please upload (or email me) config.log and config-release.out files from OMPI?
@yosefe Sorry for the late reply Yossi.
I finally got it to work with UCX. I had to specifically disable a bunch of modules. I don't exactly know which fixed the issue, but I'll provide the commands here, in case:
git clone --recursive https://github.com/openucx/ucx.git
cd ucx
git checkout v1.13.0
git submodule update --init --recursive
./autogen.sh 2>&1 | tee autogen.out
./configure --prefix=$BUILD_DIR \
--with-cuda=$CUDA_HOME \
--disable-assertions \
--disable-debug \
--disable-params-check \
--without-knem \
--without-xpmem \
--without-ofi \
--with-mlx5-dv \
--enable-logging \
--enable-compiler-opt=3 2>&1 | tee config-release.out
make -j32 all 2>&1 | tee make-release.out
make -j32 install 2>&1 | tee install-release.out
git clone --recursive https://github.com/open-mpi/ompi.git
cd ompi
git checkout v5.0.0rc9
git submodule update --init --recursive
perl autogen.pl -j 32 2>&1 | tee autogen.out
./configure --prefix=$BUILD_DIR \
--disable-io-romio \
--disable-io-ompio \
--disable-mpi-fortran \
--disable-oshmem \
--enable-mca-no-build=btl-uct,btl-portals4,btl-ofi \
--without-ofi \
--without-portals4 \
--without-ugni \
--without-knem \
--with-cuda=$CUDA_HOME \
--with-cuda-libdir=$CUDA_COMPAT_PATH \
--with-devel-headers \
--with-hwloc=internal \
--with-libevent=internal \
--with-pmix=internal \
--with-prrte=internal \
--with-ucx=$BUILD_DIR \
--with-ucx-libdir=$BUILD_DIR/lib 2>&1 | tee config-release.out
make -j32 all 2>&1 | tee make-release.out
make -j32 install 2>&1 | tee install-release.out
I am really tired of this right now, if I get a chance to figure out which one exactly fixed the issue, I'll post another follow-up comment/issue. Thanks for the help @yosefe and @jsquyres
@yosefe : For an update, I used UCX v1.12.1, and the previous configs were just working fine. I had to disable hcoll during runtime, but other than that, everything was fine, CUDA/Host pt2pt/coll.
Updating UCX from 1.12.1 to 1.13.1 or newer just caused this weird error (similar to the previous one):
[mist-login01:3653042:0:3653042] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x11)
==== backtrace (tid:3653042) ====
=================================
[mist-login01:3653043:0:3653043] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x11)
==== backtrace (tid:3653043) ====
=================================
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 0 on node mist-login01 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Maybe GCC (10.3.0) or CUDA (11.2.2) version mismatch... No idea. Anyway, I don't know if I should close this issue or not, so I leave it open.
@amirsojoodi when you updated UCX did you also rebuild OpenMPI on top of it?
@amirsojoodi when you updated UCX did you also rebuild OpenMPI on top of it?
Yes I did.
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
UCX is built successfully with:
Ompi:
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.Please describe the system on which you are running
Details of the problem
Ompi build fails at
make
with this error message, complaining about unresolved dependencies: