open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.16k stars 858 forks source link

OMPI master not happy with --with-cuda and --enable-mca-dso #9762

Closed hppritcha closed 1 year ago

hppritcha commented 2 years ago

If one configures master with both --with-cuda and --enable-mca-dso configure options one gets unresolved symbols errors swhen opal_wrapper is being linked:

  CCLD     opal_wrapper
/usr/bin/ld: ../../../opal/.libs/libopen-pal.so: undefined reference to `opal_cuda_memcpy'
/usr/bin/ld: ../../../opal/.libs/libopen-pal.so: undefined reference to `opal_cuda_memcpy_sync'
/usr/bin/ld: ../../../opal/.libs/libopen-pal.so: undefined reference to `opal_cuda_memmove'
/usr/bin/ld: ../../../opal/.libs/libopen-pal.so: undefined reference to `opal_cuda_check_bufs'
/usr/bin/ld: ../../../opal/.libs/libopen-pal.so: undefined reference to `mca_cuda_convertor_init'
hppritcha commented 2 years ago

problem isn't present in v4.1.x or older since the code of interest was located elsewhere in those branches.

hppritcha commented 2 years ago

looks like commit deb37ac03fff566b0ad235734f63213ec0775c72 introduced this regression

wzamazon commented 2 years ago

@wckzhang would you please take a look?

wckzhang commented 2 years ago

I'll try to take a look at it later this week

hppritcha commented 2 years ago

thanks @wckzhang . not super urgent, just wanted to document.

gpaulsen commented 2 years ago

@wckzhang Were you able to take a look?

wckzhang commented 2 years ago

I'll take a look while I address https://github.com/open-mpi/ompi/issues/9933

akesandgren commented 2 years ago

It also fails to build with just --enable-mca-dso, even without --with-cuda, this time when linking ompi_info

/bin/bash ../../../libtool  --tag=CC   --mode=link /hpc2n/eb/software/GCCcore/11.2.0/bin/gcc -DOPAL_CONFIGURE_USER="\"ake\"" -DOPAL_CONFIGURE_HOST="\"b-an02.hpc2n.umu.se\"" -DOPAL_CONFIGURE_DATE="\"Tue Feb  1 15:54:12 UTC 2022\"" -DOMPI_BUILD_USER="\"$USER\"" -DOMPI_BUILD_HOST="\"${HOSTNAME:-`(hostname || uname -n) | sed 1q`}\"" -DOMPI_BUILD_DATE="\"`../../../../ompi-upstream/config/getdate.sh`\"" -DOMPI_BUILD_CFLAGS="\"-O3 -DNDEBUG -finline-functions -mcx16\"" -DOMPI_BUILD_CPPFLAGS="\"-iquote../../../../ompi-upstream -iquote../../.. -iquote../../../../ompi-upstream/opal/include -iquote../../../../ompi-upstream/ompi/include -iquote../../../../ompi-upstream/oshmem/include   -I/hpc2n/eb/software/libevent/2.1.12-GCCcore-11.2.0/include  -I/hpc2n/eb/software/hwloc/2.5.0-GCCcore-11.2.0/include  -I/hpc2n/eb/software/PMIx/4.1.0-GCCcore-11.2.0/include\"" -DOMPI_BUILD_CXXFLAGS="\"-DNDEBUG \"" -DOMPI_BUILD_CXXCPPFLAGS="\"@CXXCPPFLAGS@\"" -DOMPI_BUILD_FFLAGS="\"\"" -DOMPI_BUILD_FCFLAGS="\"\"" -DOMPI_BUILD_LDFLAGS="\"\"" -DOMPI_BUILD_LIBS="\"-lpthread -lrt -lm -lutil \"" -DOPAL_CC_ABSOLUTE="\"/hpc2n/eb/software/GCCcore/11.2.0/bin/gcc\"" -DOMPI_CXX_ABSOLUTE="\"/hpc2n/eb/software/GCCcore/11.2.0/bin/g++\"" -O3 -DNDEBUG -finline-functions -mcx16   -o ompi_info ompi_info.o param.o ../../../ompi/libmpi.la ../../../opal/libopen-pal.la -lpthread -lrt -lm -lutil 
libtool: link: /hpc2n/eb/software/GCCcore/11.2.0/bin/gcc -DOPAL_CONFIGURE_USER=\"ake\" -DOPAL_CONFIGURE_HOST=\"b-an02.hpc2n.umu.se\" "-DOPAL_CONFIGURE_DATE=\"Tue Feb  1 15:54:12 UTC 2022\"" -DOMPI_BUILD_USER=\"ake\" -DOMPI_BUILD_HOST=\"b-an02.hpc2n.umu.se\" "-DOMPI_BUILD_DATE=\"Tue 01 Feb 2022 04:04:35 PM UTC\"" "-DOMPI_BUILD_CFLAGS=\"-O3 -DNDEBUG -finline-functions -mcx16\"" "-DOMPI_BUILD_CPPFLAGS=\"-iquote../../../../ompi-upstream -iquote../../.. -iquote../../../../ompi-upstream/opal/include -iquote../../../../ompi-upstream/ompi/include -iquote../../../../ompi-upstream/oshmem/include   -I/hpc2n/eb/software/libevent/2.1.12-GCCcore-11.2.0/include  -I/hpc2n/eb/software/hwloc/2.5.0-GCCcore-11.2.0/include  -I/hpc2n/eb/software/PMIx/4.1.0-GCCcore-11.2.0/include\"" "-DOMPI_BUILD_CXXFLAGS=\"-DNDEBUG \"" -DOMPI_BUILD_CXXCPPFLAGS=\"@CXXCPPFLAGS@\" -DOMPI_BUILD_FFLAGS=\"\" -DOMPI_BUILD_FCFLAGS=\"\" -DOMPI_BUILD_LDFLAGS=\"\" "-DOMPI_BUILD_LIBS=\"-lpthread -lrt -lm -lutil \"" -DOPAL_CC_ABSOLUTE=\"/hpc2n/eb/software/GCCcore/11.2.0/bin/gcc\" -DOMPI_CXX_ABSOLUTE=\"/hpc2n/eb/software/GCCcore/11.2.0/bin/g++\" -O3 -DNDEBUG -finline-functions -mcx16 -o .libs/ompi_info ompi_info.o param.o  ../../../ompi/.libs/libmpi.so -L/hpc2n/eb/software/PMIx/4.1.0-GCCcore-11.2.0/lib -L/hpc2n/eb/software/OpenSSL/1.1/lib64 -L/hpc2n/eb/software/OpenSSL/1.1/lib -L/hpc2n/eb/software/zlib/1.2.11-GCCcore-11.2.0/lib64 -L/hpc2n/eb/software/zlib/1.2.11-GCCcore-11.2.0/lib -L/hpc2n/eb/software/binutils/2.37-GCCcore-11.2.0/lib64 -L/hpc2n/eb/software/binutils/2.37-GCCcore-11.2.0/lib -L/hpc2n/eb/software/GCCcore/11.2.0/lib64 -L/hpc2n/eb/software/GCCcore/11.2.0/lib -L/hpc2n/eb/software/libpciaccess/0.16-GCCcore-11.2.0/lib -L/hpc2n/eb/software/libxml2/2.9.10-GCCcore-11.2.0/lib -L/hpc2n/eb/software/XZ/5.2.5-GCCcore-11.2.0/lib64 -L/hpc2n/eb/software/XZ/5.2.5-GCCcore-11.2.0/lib -L/hpc2n/eb/software/gettext/0.21/lib64 -L/hpc2n/eb/software/gettext/0.21/lib -L/hpc2n/eb/software/libpciaccess/0.16-GCCcore-11.2.0/lib64 -L/hpc2n/eb/software/libxml2/2.9.10-GCCcore-11.2.0/lib64 -L/hpc2n/eb/software/numactl/2.0.14-GCCcore-11.2.0/lib64 -L/hpc2n/eb/software/numactl/2.0.14-GCCcore-11.2.0/lib -L/hpc2n/eb/software/libfabric/1.13.2-GCCcore-11.2.0/lib64 -L/hpc2n/eb/software/libfabric/1.13.2-GCCcore-11.2.0/lib -L/hpc2n/eb/software/hwloc/2.5.0-GCCcore-11.2.0/lib64 -L/hpc2n/eb/software/hwloc/2.5.0-GCCcore-11.2.0/lib -L/hpc2n/eb/software/libevent/2.1.12-GCCcore-11.2.0/lib64 -L/hpc2n/eb/software/libevent/2.1.12-GCCcore-11.2.0/lib /home/a/ake/support-hpc2n/ake/openmpi-testing/510-test/build-cuda-master-dso/opal/.libs/libopen-pal.so ../../../opal/.libs/libopen-pal.so /hpc2n/eb/software/PMIx/4.1.0-GCCcore-11.2.0/lib/libpmix.so /usr/lib/x86_64-linux-gnu/libmunge.so -llustreapi /hpc2n/eb/software/libevent/2.1.12-GCCcore-11.2.0/lib/libevent_core.so /hpc2n/eb/software/libevent/2.1.12-GCCcore-11.2.0/lib/libevent_pthreads.so /hpc2n/eb/software/hwloc/2.5.0-GCCcore-11.2.0/lib/libhwloc.so /hpc2n/eb/software/libpciaccess/0.16-GCCcore-11.2.0/lib/libpciaccess.so /hpc2n/eb/software/libxml2/2.9.10-GCCcore-11.2.0/lib/libxml2.so -ldl -lz /hpc2n/eb/software/XZ/5.2.5-GCCcore-11.2.0/lib/liblzma.so -lpthread -lrt -lm -lutil -pthread -Wl,-rpath -Wl,/proj/nobackup/support-hpc2n/ake/openmpi-testing/510-test/inst-cuda-master-dso/lib -Wl,-rpath -Wl,/hpc2n/eb/software/libevent/2.1.12-GCCcore-11.2.0/lib -Wl,-rpath -Wl,/hpc2n/eb/software/hwloc/2.5.0-GCCcore-11.2.0/lib -Wl,-rpath -Wl,/hpc2n/eb/software/PMIx/4.1.0-GCCcore-11.2.0/lib -Wl,-rpath -Wl,/hpc2n/eb/software/libpciaccess/0.16-GCCcore-11.2.0/lib -Wl,-rpath -Wl,/hpc2n/eb/software/libxml2/2.9.10-GCCcore-11.2.0/lib -Wl,-rpath -Wl,/hpc2n/eb/software/XZ/5.2.5-GCCcore-11.2.0/lib
../../../ompi/.libs/libmpi.so: error: undefined reference to 'mca_common_ompio_decode_datatype'
../../../ompi/.libs/libmpi.so: error: undefined reference to 'mca_common_ompio_set_aggregator_props'
edgargabriel commented 2 years ago

pr #9944 has been filed to fix this issue

wckzhang commented 2 years ago

Was able to reproduce, trying to wrap my head around the mca dso stuff now.

akesandgren commented 2 years ago

Due to other reasons I'd strongly recommend reverting https://github.com/open-mpi/ompi/commit/deb37ac03fff566b0ad235734f63213ec0775c72 From my perspective that code doesn't belong in mca_common_cuda since it isn't directly depending on cuda.h It will also fix this problem for --enable-mca-dso=all together with --with-cuda

wckzhang commented 2 years ago

I think the issue occurred from this patch -https://github.com/open-mpi/ompi/pull/8788/commits/c81cdd76897499fb42099ef784fb2dfd86cc9f06

What happened I think was:

  1. Made it work with --enable-mca-dso (when it was by default)
  2. We changed build style to default to static (it didn't work with static at the time and broke)
  3. Made it work with --enable-mca-static (the new default) - and broke it with --enable-mca-dso

So I think the fix would be to re-add the code that was removed in the commit I mentioned and surround it with a mca build dso check. I really just moved the build breakage from static -> dynamic and got confused by the default build switching.

wckzhang commented 2 years ago

To summarize, this bug was caused when we moved the cuda copy/malloc/etc. code from datatype -> common. When building with dynamic dso's, libopen-pal needs the common cuda component as a dependency with that change. In addition, simply adding the dependency breaks design as the core libraries were never intended to depend on common libraries. Instead of reverting the code move, the requested change is to add an accelerator framework to host the cuda memory management. Will need this in master and 5.0.x

wckzhang commented 1 year ago

This shouldn't be a problem anymore, closing.