Closed hppritcha closed 1 year ago
problem isn't present in v4.1.x or older since the code of interest was located elsewhere in those branches.
looks like commit deb37ac03fff566b0ad235734f63213ec0775c72 introduced this regression
@wckzhang would you please take a look?
I'll try to take a look at it later this week
thanks @wckzhang . not super urgent, just wanted to document.
@wckzhang Were you able to take a look?
I'll take a look while I address https://github.com/open-mpi/ompi/issues/9933
It also fails to build with just --enable-mca-dso, even without --with-cuda, this time when linking ompi_info
/bin/bash ../../../libtool --tag=CC --mode=link /hpc2n/eb/software/GCCcore/11.2.0/bin/gcc -DOPAL_CONFIGURE_USER="\"ake\"" -DOPAL_CONFIGURE_HOST="\"b-an02.hpc2n.umu.se\"" -DOPAL_CONFIGURE_DATE="\"Tue Feb 1 15:54:12 UTC 2022\"" -DOMPI_BUILD_USER="\"$USER\"" -DOMPI_BUILD_HOST="\"${HOSTNAME:-`(hostname || uname -n) | sed 1q`}\"" -DOMPI_BUILD_DATE="\"`../../../../ompi-upstream/config/getdate.sh`\"" -DOMPI_BUILD_CFLAGS="\"-O3 -DNDEBUG -finline-functions -mcx16\"" -DOMPI_BUILD_CPPFLAGS="\"-iquote../../../../ompi-upstream -iquote../../.. -iquote../../../../ompi-upstream/opal/include -iquote../../../../ompi-upstream/ompi/include -iquote../../../../ompi-upstream/oshmem/include -I/hpc2n/eb/software/libevent/2.1.12-GCCcore-11.2.0/include -I/hpc2n/eb/software/hwloc/2.5.0-GCCcore-11.2.0/include -I/hpc2n/eb/software/PMIx/4.1.0-GCCcore-11.2.0/include\"" -DOMPI_BUILD_CXXFLAGS="\"-DNDEBUG \"" -DOMPI_BUILD_CXXCPPFLAGS="\"@CXXCPPFLAGS@\"" -DOMPI_BUILD_FFLAGS="\"\"" -DOMPI_BUILD_FCFLAGS="\"\"" -DOMPI_BUILD_LDFLAGS="\"\"" -DOMPI_BUILD_LIBS="\"-lpthread -lrt -lm -lutil \"" -DOPAL_CC_ABSOLUTE="\"/hpc2n/eb/software/GCCcore/11.2.0/bin/gcc\"" -DOMPI_CXX_ABSOLUTE="\"/hpc2n/eb/software/GCCcore/11.2.0/bin/g++\"" -O3 -DNDEBUG -finline-functions -mcx16 -o ompi_info ompi_info.o param.o ../../../ompi/libmpi.la ../../../opal/libopen-pal.la -lpthread -lrt -lm -lutil
libtool: link: /hpc2n/eb/software/GCCcore/11.2.0/bin/gcc -DOPAL_CONFIGURE_USER=\"ake\" -DOPAL_CONFIGURE_HOST=\"b-an02.hpc2n.umu.se\" "-DOPAL_CONFIGURE_DATE=\"Tue Feb 1 15:54:12 UTC 2022\"" -DOMPI_BUILD_USER=\"ake\" -DOMPI_BUILD_HOST=\"b-an02.hpc2n.umu.se\" "-DOMPI_BUILD_DATE=\"Tue 01 Feb 2022 04:04:35 PM UTC\"" "-DOMPI_BUILD_CFLAGS=\"-O3 -DNDEBUG -finline-functions -mcx16\"" "-DOMPI_BUILD_CPPFLAGS=\"-iquote../../../../ompi-upstream -iquote../../.. -iquote../../../../ompi-upstream/opal/include -iquote../../../../ompi-upstream/ompi/include -iquote../../../../ompi-upstream/oshmem/include -I/hpc2n/eb/software/libevent/2.1.12-GCCcore-11.2.0/include -I/hpc2n/eb/software/hwloc/2.5.0-GCCcore-11.2.0/include -I/hpc2n/eb/software/PMIx/4.1.0-GCCcore-11.2.0/include\"" "-DOMPI_BUILD_CXXFLAGS=\"-DNDEBUG \"" -DOMPI_BUILD_CXXCPPFLAGS=\"@CXXCPPFLAGS@\" -DOMPI_BUILD_FFLAGS=\"\" -DOMPI_BUILD_FCFLAGS=\"\" -DOMPI_BUILD_LDFLAGS=\"\" "-DOMPI_BUILD_LIBS=\"-lpthread -lrt -lm -lutil \"" -DOPAL_CC_ABSOLUTE=\"/hpc2n/eb/software/GCCcore/11.2.0/bin/gcc\" -DOMPI_CXX_ABSOLUTE=\"/hpc2n/eb/software/GCCcore/11.2.0/bin/g++\" -O3 -DNDEBUG -finline-functions -mcx16 -o .libs/ompi_info ompi_info.o param.o ../../../ompi/.libs/libmpi.so -L/hpc2n/eb/software/PMIx/4.1.0-GCCcore-11.2.0/lib -L/hpc2n/eb/software/OpenSSL/1.1/lib64 -L/hpc2n/eb/software/OpenSSL/1.1/lib -L/hpc2n/eb/software/zlib/1.2.11-GCCcore-11.2.0/lib64 -L/hpc2n/eb/software/zlib/1.2.11-GCCcore-11.2.0/lib -L/hpc2n/eb/software/binutils/2.37-GCCcore-11.2.0/lib64 -L/hpc2n/eb/software/binutils/2.37-GCCcore-11.2.0/lib -L/hpc2n/eb/software/GCCcore/11.2.0/lib64 -L/hpc2n/eb/software/GCCcore/11.2.0/lib -L/hpc2n/eb/software/libpciaccess/0.16-GCCcore-11.2.0/lib -L/hpc2n/eb/software/libxml2/2.9.10-GCCcore-11.2.0/lib -L/hpc2n/eb/software/XZ/5.2.5-GCCcore-11.2.0/lib64 -L/hpc2n/eb/software/XZ/5.2.5-GCCcore-11.2.0/lib -L/hpc2n/eb/software/gettext/0.21/lib64 -L/hpc2n/eb/software/gettext/0.21/lib -L/hpc2n/eb/software/libpciaccess/0.16-GCCcore-11.2.0/lib64 -L/hpc2n/eb/software/libxml2/2.9.10-GCCcore-11.2.0/lib64 -L/hpc2n/eb/software/numactl/2.0.14-GCCcore-11.2.0/lib64 -L/hpc2n/eb/software/numactl/2.0.14-GCCcore-11.2.0/lib -L/hpc2n/eb/software/libfabric/1.13.2-GCCcore-11.2.0/lib64 -L/hpc2n/eb/software/libfabric/1.13.2-GCCcore-11.2.0/lib -L/hpc2n/eb/software/hwloc/2.5.0-GCCcore-11.2.0/lib64 -L/hpc2n/eb/software/hwloc/2.5.0-GCCcore-11.2.0/lib -L/hpc2n/eb/software/libevent/2.1.12-GCCcore-11.2.0/lib64 -L/hpc2n/eb/software/libevent/2.1.12-GCCcore-11.2.0/lib /home/a/ake/support-hpc2n/ake/openmpi-testing/510-test/build-cuda-master-dso/opal/.libs/libopen-pal.so ../../../opal/.libs/libopen-pal.so /hpc2n/eb/software/PMIx/4.1.0-GCCcore-11.2.0/lib/libpmix.so /usr/lib/x86_64-linux-gnu/libmunge.so -llustreapi /hpc2n/eb/software/libevent/2.1.12-GCCcore-11.2.0/lib/libevent_core.so /hpc2n/eb/software/libevent/2.1.12-GCCcore-11.2.0/lib/libevent_pthreads.so /hpc2n/eb/software/hwloc/2.5.0-GCCcore-11.2.0/lib/libhwloc.so /hpc2n/eb/software/libpciaccess/0.16-GCCcore-11.2.0/lib/libpciaccess.so /hpc2n/eb/software/libxml2/2.9.10-GCCcore-11.2.0/lib/libxml2.so -ldl -lz /hpc2n/eb/software/XZ/5.2.5-GCCcore-11.2.0/lib/liblzma.so -lpthread -lrt -lm -lutil -pthread -Wl,-rpath -Wl,/proj/nobackup/support-hpc2n/ake/openmpi-testing/510-test/inst-cuda-master-dso/lib -Wl,-rpath -Wl,/hpc2n/eb/software/libevent/2.1.12-GCCcore-11.2.0/lib -Wl,-rpath -Wl,/hpc2n/eb/software/hwloc/2.5.0-GCCcore-11.2.0/lib -Wl,-rpath -Wl,/hpc2n/eb/software/PMIx/4.1.0-GCCcore-11.2.0/lib -Wl,-rpath -Wl,/hpc2n/eb/software/libpciaccess/0.16-GCCcore-11.2.0/lib -Wl,-rpath -Wl,/hpc2n/eb/software/libxml2/2.9.10-GCCcore-11.2.0/lib -Wl,-rpath -Wl,/hpc2n/eb/software/XZ/5.2.5-GCCcore-11.2.0/lib
../../../ompi/.libs/libmpi.so: error: undefined reference to 'mca_common_ompio_decode_datatype'
../../../ompi/.libs/libmpi.so: error: undefined reference to 'mca_common_ompio_set_aggregator_props'
pr #9944 has been filed to fix this issue
Was able to reproduce, trying to wrap my head around the mca dso stuff now.
Due to other reasons I'd strongly recommend reverting https://github.com/open-mpi/ompi/commit/deb37ac03fff566b0ad235734f63213ec0775c72 From my perspective that code doesn't belong in mca_common_cuda since it isn't directly depending on cuda.h It will also fix this problem for --enable-mca-dso=all together with --with-cuda
I think the issue occurred from this patch -https://github.com/open-mpi/ompi/pull/8788/commits/c81cdd76897499fb42099ef784fb2dfd86cc9f06
What happened I think was:
So I think the fix would be to re-add the code that was removed in the commit I mentioned and surround it with a mca build dso check. I really just moved the build breakage from static -> dynamic and got confused by the default build switching.
To summarize, this bug was caused when we moved the cuda copy/malloc/etc. code from datatype -> common. When building with dynamic dso's, libopen-pal needs the common cuda component as a dependency with that change. In addition, simply adding the dependency breaks design as the core libraries were never intended to depend on common libraries. Instead of reverting the code move, the requested change is to add an accelerator framework to host the cuda memory management. Will need this in master and 5.0.x
This shouldn't be a problem anymore, closing.
If one configures master with both --with-cuda and --enable-mca-dso configure options one gets unresolved symbols errors swhen opal_wrapper is being linked: