open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.15k stars 858 forks source link

Open MPI make fails with UCX, undefined reference to `ucp_tag_recv_nbx' #11366

Open amirsojoodi opened 1 year ago

amirsojoodi commented 1 year ago

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

UCX is built successfully with:

git clone --recursive https://github.com/openucx/ucx.git
cd ucx
git checkout v1.13.0
git submodule update

./autogen.sh 2>&1 | tee autogen.out

./configure --prefix=$BUILD_DIR \
  --with-cuda=$CUDA_HOME \
  --disable-assertions \
  --disable-debug \
  --disable-logging \
  --disable-params-check \
  --enable-compiler-opt=3 \
  --enable-devel-headers \
  --enable-mt \
  --enable-optimizations 2>&1 | tee config-release.out

make -j32 all 2>&1 | tee make-release.out
make -j32 install 2>&1 | tee install-release.out

Ompi:

git clone --recursive https://github.com/open-mpi/ompi.git
cd ompi
git checkout v5.0.0rc9
git submodule update

perl autogen.pl --no-oshmem 2>&1 | tee autogen.out

./configure --prefix=$BUILD_DIR \
  --disable-io-romio \
  --disable-io-ompio \
  --disable-mpi-fortran \
  --disable-oshmem \
  --enable-mca-no-build=btl-portals4,coll-hcoll \
  --with-cuda=$CUDA_HOME \
  --with-devel-headers \
  --with-hwloc=internal \
  --with-libevent=internal \
  --with-pmix=internal \
  --with-prrte=internal \
  --enable-mca-dso=coll-cuda\
  --enable-mca-static=coll-cuda\
  --with-ucx=$BUILD_DIR 2>&1 | tee config-release.out

make -j32 all 2>&1 | tee make-release.out
make -j32 install 2>&1 | tee install-release.out

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

$ git submodule status
 b22eca2c462b61533572634a0abbf212f283578d 3rd-party/openpmix (v4.2.2rc2-1-gb22eca2c)
 ab03675e5a9014418418555ceb188d2573713870 3rd-party/prrte (v3.0.0rc3-1-gab03675e5a)

Please describe the system on which you are running


Details of the problem

Ompi build fails at make with this error message, complaining about unresolved dependencies:

Making all in tools/ompi_info
make[2]: Entering directory '/gpfs/fs0/project/q/queenspp/sojoodi/OpenMPI-Release/ompi/ompi/tools/ompi_info'
  CC       ompi_info.o
  CC       param.o
  CCLD     ompi_info
/project/q/queenspp/sojoodi/OpenMPI-Release/ompi/opal/.libs/libopen-pal.so: undefined reference to `ucm_test_external_events'
../../../ompi/.libs/libmpi.so: undefined reference to `ucp_tag_recv_nbx'
../../../ompi/.libs/libmpi.so: undefined reference to `ucp_tag_send_nbx'
collect2: error: ld returned 1 exit status
make[2]: *** [Makefile:1356: ompi_info] Error 1
make[2]: Leaving directory '/gpfs/fs0/project/q/queenspp/sojoodi/OpenMPI-Release/ompi/ompi/tools/ompi_info'
make[1]: *** [Makefile:2682: all-recursive] Error 1
make[1]: Leaving directory '/gpfs/fs0/project/q/queenspp/sojoodi/OpenMPI-Release/ompi/ompi'
make: *** [Makefile:1409: all-recursive] Error 1
amirsojoodi commented 1 year ago

Also, after applying the solution discussed here:

Adding LIBS="-lucm -lucs" to ompi configure command.

the issue persists

amirsojoodi commented 1 year ago

Interestingly, setting LDFLAGS before running configure resolved the problem.

Shouldn't it automatically look in this directory for libs? 🤔

export LDFLAGS="-L$BUILD_DIR/lib"

./configure --prefix=$BUILD_DIR \
  --disable-io-romio \
  --disable-io-ompio \
  --disable-mpi-fortran \
  --disable-oshmem \
  --enable-mca-no-build=btl-portals4,coll-hcoll \
  --with-cuda=$CUDA_HOME \
  --with-devel-headers \
  --with-hwloc=internal \
  --with-libevent=internal \
  --with-pmix=internal \
  --with-prrte=internal \
  --enable-mca-dso=coll-cuda\
  --enable-mca-static=coll-cuda\
  --with-ucx=$BUILD_DIR 2>&1 | tee config-release.out
jsquyres commented 1 year ago

Shouldn't it automatically look in this directory for libs? 🤔

Yes.

@open-mpi/ucx please have a look.

yosefe commented 1 year ago

@amirsojoodi i've tried the above commands and it worked ok for me (on CentOS 7.9) can you pls post the output of

cd ompi
grep pml_ucx config.status
amirsojoodi commented 1 year ago

@yosefe: Thanks for the follow up.

$ grep pml_ucx config.status
S["MCA_oshmem_spml_STATIC_LTLIBS"]="mca/spml/ucx/libmca_spml_ucx.la "
S["MCA_BUILD_oshmem_spml_ucx_DSO_FALSE"]=""
S["MCA_BUILD_oshmem_spml_ucx_DSO_TRUE"]="#"
S["spml_ucx_LIBS"]="-lucp -luct -lucs -lucm "
S["spml_ucx_LDFLAGS"]=""
S["spml_ucx_CPPFLAGS"]=""
S["MCA_ompi_pml_STATIC_LTLIBS"]="mca/pml/v/libmca_pml_v.la mca/pml/ucx/libmca_pml_ucx.la mca/pml/ob1/libmca_pml_ob1.la mca/pml/cm/libmca_pml_cm.la "
S["MCA_BUILD_ompi_pml_ucx_DSO_FALSE"]=""
S["MCA_BUILD_ompi_pml_ucx_DSO_TRUE"]="#"
S["pml_ucx_LIBS"]="-lucp -luct -lucs -lucm "
S["pml_ucx_LDFLAGS"]=""
S["pml_ucx_CPPFLAGS"]=""

I am on a PowerPC machine with RedHat 8.4

BTW, even now that I can build ompi, I have weird Segfaults. I'll post them in the next comment.

amirsojoodi commented 1 year ago
$ mpirun --mca pml ucx --mca btl ^smcuda,vader,openib,uct \
    -x UCX_TLS=rc,sm,cuda_copy,gdr_copy,cuda_ipc \
    -np 2 $BUILD_DIR/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw \
    --window-size 1 -m 2097152:67108864 H H

[mist-login01:4176932:0:4176932] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x11)
==== backtrace (tid:4176932) ====
=================================
--------------------------------------------------------------------------
prterun noticed that process rank 1 with PID 0 on node mist-login01 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

The error doesn't change with CUDA benchmarks. It seems there is a problem in MPI_Init. I tried MPI_THREAD_SINGLE, too, but no luck. Also, I rebuilt ucx without --enable-mt, no luck again.

However, changing pml from ucx to ob1 somehow works.

mpirun --mca pml ob1 --mca btl '^vader,tcp,openib,uct' -np 2 \
  /project/q/queenspp/sojoodi/OpenMPI-Release/build/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw \
  --window-size 1 -m 2097152:67108864 H H
yosefe commented 1 year ago

@amirsojoodi i've tried on Centos 8.4 and it works fine for me. Maybe there is an older version of UCX installed on your system? Can you please upload (or email me) config.log and config-release.out files from OMPI?

amirsojoodi commented 1 year ago

@yosefe Sorry for the late reply Yossi.

I finally got it to work with UCX. I had to specifically disable a bunch of modules. I don't exactly know which fixed the issue, but I'll provide the commands here, in case:

git clone --recursive https://github.com/openucx/ucx.git
cd ucx
git checkout v1.13.0
git submodule update --init --recursive

./autogen.sh 2>&1 | tee autogen.out

./configure --prefix=$BUILD_DIR \
  --with-cuda=$CUDA_HOME \
  --disable-assertions \
  --disable-debug \
  --disable-params-check \
  --without-knem \
  --without-xpmem \
  --without-ofi \
  --with-mlx5-dv \
  --enable-logging \
  --enable-compiler-opt=3 2>&1 | tee config-release.out

make -j32 all 2>&1 | tee make-release.out
make -j32 install 2>&1 | tee install-release.out
git clone --recursive https://github.com/open-mpi/ompi.git
cd ompi
git checkout v5.0.0rc9
git submodule update --init --recursive

perl autogen.pl -j 32 2>&1 | tee autogen.out

./configure --prefix=$BUILD_DIR \
  --disable-io-romio \
  --disable-io-ompio \
  --disable-mpi-fortran \
  --disable-oshmem \
  --enable-mca-no-build=btl-uct,btl-portals4,btl-ofi \
  --without-ofi \
  --without-portals4 \
  --without-ugni \
  --without-knem \
  --with-cuda=$CUDA_HOME \
  --with-cuda-libdir=$CUDA_COMPAT_PATH \
  --with-devel-headers \
  --with-hwloc=internal \
  --with-libevent=internal \
  --with-pmix=internal \
  --with-prrte=internal \
  --with-ucx=$BUILD_DIR \
  --with-ucx-libdir=$BUILD_DIR/lib 2>&1 | tee config-release.out

make -j32 all 2>&1 | tee make-release.out
make -j32 install 2>&1 | tee install-release.out

I am really tired of this right now, if I get a chance to figure out which one exactly fixed the issue, I'll post another follow-up comment/issue. Thanks for the help @yosefe and @jsquyres

amirsojoodi commented 1 year ago

@yosefe : For an update, I used UCX v1.12.1, and the previous configs were just working fine. I had to disable hcoll during runtime, but other than that, everything was fine, CUDA/Host pt2pt/coll.

Updating UCX from 1.12.1 to 1.13.1 or newer just caused this weird error (similar to the previous one):

[mist-login01:3653042:0:3653042] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x11)
==== backtrace (tid:3653042) ====
=================================
[mist-login01:3653043:0:3653043] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x11)
==== backtrace (tid:3653043) ====
=================================
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 0 on node mist-login01 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Maybe GCC (10.3.0) or CUDA (11.2.2) version mismatch... No idea. Anyway, I don't know if I should close this issue or not, so I leave it open.

yosefe commented 1 year ago

@amirsojoodi when you updated UCX did you also rebuild OpenMPI on top of it?

amirsojoodi commented 1 year ago

@amirsojoodi when you updated UCX did you also rebuild OpenMPI on top of it?

Yes I did.