open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.14k stars 858 forks source link

problem linking mpi memory routines from C++ #12247

Closed floquet-cxx closed 9 months ago

floquet-cxx commented 9 months ago

Thank you for taking the time to submit an issue!

Background information

I am compiling an application (my own) which has C++ as the top level but also uses some C and F77 libraries. The build system is cmake on OSX via macports. The application is about 2 decades old now, and has successfully used various versions of MPI-1 for most of that time. Recently I added some new MPI memory and IO routines to do with Cartesian communicators and parallel IO, and have created a small set of linkage problems if using openmpi (they can however be linked successfully if I use mpich). The problem has persisted across two OS X versions, so I suspect the problem lies in openmpi code base. It has the flavour of a misplaced "extern" declaration in a header file.

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

4.1.6

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Installed via macports. (Of course this leaves open the possibility that the problem lies there...)

sudo port -N install  openmpi +gfortran
sudo port select --set mpi openmpi-mp-fortran

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running


Details of the problem

The build fails at the linkage stage, unable to find 4 low-level (not API-level) MPI routines called from various object files. Below is a clip of the relevant messages.

[ 53%] Linking CXX executable elliptic_mp
Undefined symbols for architecture arm64:
  "__ZN3MPI3Win4FreeEv", referenced from:
      __ZTVN3MPI3WinE in auxfield.cpp.o
      __ZTVN3MPI3WinE in data2df.cpp.o
      __ZTVN3MPI3WinE in domain.cpp.o
      __ZTVN3MPI3WinE in geometry.cpp.o
      __ZTVN3MPI3WinE in mesh.cpp.o
      __ZTVN3MPI3WinE in message.cpp.o
      __ZTVN3MPI3WinE in helmholtz.cpp.o
      ...
  "__ZN3MPI4CommC2Ev", referenced from:
      __ZNK3MPI9Intracomm5CloneEv in auxfield.cpp.o
      __ZNK3MPI9Graphcomm5CloneEv in auxfield.cpp.o
      __ZNK3MPI8Cartcomm3SubEPKb in auxfield.cpp.o
      __ZNK3MPI9Intracomm12Create_graphEiPKiS2_b in auxfield.cpp.o
      __ZNK3MPI8Cartcomm5CloneEv in auxfield.cpp.o
      __ZNK3MPI9Intracomm11Create_cartEiPKiPKbb in auxfield.cpp.o
      __ZNK3MPI9Intercomm5MergeEb in auxfield.cpp.o
      ...
  "__ZN3MPI8Datatype4FreeEv", referenced from:
      __ZTVN3MPI8DatatypeE in auxfield.cpp.o
      __ZTVN3MPI8DatatypeE in data2df.cpp.o
      __ZTVN3MPI8DatatypeE in domain.cpp.o
      __ZTVN3MPI8DatatypeE in geometry.cpp.o
      __ZTVN3MPI8DatatypeE in mesh.cpp.o
      __ZTVN3MPI8DatatypeE in message.cpp.o
      __ZTVN3MPI8DatatypeE in helmholtz.cpp.o
      ...
  "_ompi_mpi_cxx_op_intercept", referenced from:
      __ZN3MPI2Op4InitEPFvPKvPviRKNS_8DatatypeEEb in auxfield.cpp.o
      __ZN3MPI2Op4InitEPFvPKvPviRKNS_8DatatypeEEb in data2df.cpp.o
      __ZN3MPI2Op4InitEPFvPKvPviRKNS_8DatatypeEEb in domain.cpp.o
      __ZN3MPI2Op4InitEPFvPKvPviRKNS_8DatatypeEEb in geometry.cpp.o
      __ZN3MPI2Op4InitEPFvPKvPviRKNS_8DatatypeEEb in mesh.cpp.o
      __ZN3MPI2Op4InitEPFvPKvPviRKNS_8DatatypeEEb in message.cpp.o
      __ZN3MPI2Op4InitEPFvPKvPviRKNS_8DatatypeEEb in helmholtz.cpp.o
      ...
ld: symbol(s) not found for architecture arm64
collect2: error: ld returned 1 exit status
make[2]: *** [elliptic_mp] Error 1
make[1]: *** [CMakeFiles/elliptic_mp.dir/all] Error 2
make: *** [all] Error 2

These linkage errors do not occur if I use mpich instead of openmpi (but right now I have issues with mpich not working on Sonoma). As I said, this looks like it could be an issue with C++ mangling routine names as a result of a misplaced "extern" declaration in a header.

An older code version which used a smaller subset of MPI continues to compile and run fine.

ggouaillardet commented 9 months ago

Your application is using the MPI C++ bindings that have been removed from the MPI standard more than a decade ago.

The right fix is to modernize your code and stop using them.

Meanwhile, you can rebuild Open MPI by passing --enable-mpi-cxx to the configure command line (note that won't be possible any more with Open MPI 5)

floquet-cxx commented 9 months ago

Thanks - now could you please suggest a simple way to revert to using C bindings? It seems, I can't just put extern "C" { #include }. Perhaps I can put extern "C" {} around each and every instance of MPI use, but I don't think so.

An alternative may be to pull them all the routines I've used out of C++ code and put them into a C code file; effectively, to re-wrap them - that used to be my approach when I was only calling a small set of MPI routines. (Forgive me but that somehow doesn't seem like "modernization"!)

ggouaillardet commented 9 months ago

This is not straight forward, but this is not rocket science either.

For example, you can compare examples/ring_c.c and examples/ring_cxx.cc to get an idea of what has to be changed.

Stopping using an API that has been removed from the MPI standard more than a decade ago is indeed modernization. But if you believe/expect something more C++ish is the way to modernize your code, then feel free to do your own research starting for example with Boost.MPI or Elementals (do not ask me to help you though)

rhc54 commented 9 months ago

Another, perhaps simpler, option might be to just use OMPI v3.x - IIRC, the C++ bindings were still supported at that time. Unless you really need something in the 4.x series, you can probably find an older version that works for you.

floquet-cxx commented 9 months ago

Another question occurs to me though: why don't I get the same issues when I call a more restricted set of MPI routines directly from C++? That approach seems to work fine with openmpi. I would have thought that I should get many more instances of linkage failure if the problem is related to removal of C++ bindings.

ggouaillardet commented 9 months ago

you are probably using the C bindings.

floquet-cxx commented 9 months ago

But, a top-level instance of #include is used in every C++ file I have. So how could I be using the C bindings? Maybe because testing for CXX is not uniform/consistent across openmpi include files? That seems a bit unlikely.

However, I'm happy to believe you that I should be using C bindings now, and make the adjustments if I have to.

ggouaillardet commented 9 months ago

there is only one include file, and it is mpi.h

For example you use the C bindings if you do something like MPI_Send(..., MPI_COMM_WORLD), and you use the C++ bindings if you MPI::COMM_WORLD.Send(...)

That being said, maybe the C++ bindings have been built and your link command is missing -lmpi_cxx That can happen if you link manually or if you link with mpicc instead of mpicxx

floquet-cxx commented 9 months ago

Oh for sure, I am just using the C bindings within my C++ code. For example

MPI_File_set_view
  (fh, skip, Geometry::contig(), Geometry::fileview(),
   "native", MPI_INFO_NULL);

Thank you, changing the link command is a good idea. I don't have such easy direct control over link flags because I'm using macports and cmake. I will have a go though, and also feed that back to the macports maintainers.

Also, I can't find ring_cxx.cc in the examples...

floquet-cxx commented 9 months ago

Hmmmm. I should just be able to use the C bindings directly as I am doing though, right? Again, I am wondering if there is a missing extern "C" in some header file. I see a few remarks about this when I look through the headers.

ggouaillardet commented 9 months ago

That's a C binding, but not the one the error message is about.

Try to grep Clone and see what you got

the example can be seen at https://github.com/open-mpi/ompi/blob/v4.1.x/examples/ring_cxx.cc

floquet-cxx commented 9 months ago

Hmmm. No Clone, but MPI_Datatype does certainly get used. And MPI_Type_free(). All in one simple base file. Both these things come up on the errors. Perhaps I'm mis-using something, let me review that file. (Why that would have worked with mpich, I don't know...). Thanks again for your suggestions.

ggouaillardet commented 9 months ago

Since you are not using the MPI C++ bindings, you should be able to compile with -DOMPI_SKIP_MPICXX By doing so, no MPI C++ headers will be pulled. But if the compilation fails, that would suggests your code is trying to use them.

jsquyres commented 9 months ago

I see there's been a bunch of back-n-forth here -- let me throw in a few things to check:

  1. It would be good to see exactly what underlying command is being run at [ 53%] Linking CXX executable elliptic_mp.
  2. It would probably be good to make a small example and see if you can narrow down the problem from there. E.g., seeing a linker complain about not finding __ZN3MPI3Win4FreeEv -- I can see it's clearly looking for some kind of variant of MPI_Win_free, but I don't know why the "Free" is capitalized in the missing symbol error message (it's lower case in the C bindings), and I don't know why the name would be munged (it's not munged in the C bindings). It's been a long, long time since I've worked with C++ so I don't remember these kinds of details, but is there a chance you're calling MPI_Win_Free() somewhere instead of MPI_Win_free()?
ggouaillardet commented 9 months ago

FWIW __ZN3MPI3Win4FreeEv is the mangled symbol for MPI::Win::Free()

That suggests the MPI C++ bindings are indeed available but libmpi_cxx.so is not used at link time.

I naively expect these undefined references would not be there if they were not used by the application, but you know, C++ does C++, so maybe I should drop my expectations.

jsquyres commented 9 months ago

FWIW __ZN3MPI3Win4FreeEv is the mangled symbol for MPI::Win::Free()

Ah, there it is. Ok.

Then I think it would be very good to see exactly what the underlying command is for [ 53%] Linking CXX executable elliptic_mp. It could be as simple as accidentally using mpicc instead of mpic++.

floquet-cxx commented 9 months ago

I have no idea why any MPI C++ bindings get invoked if I have only used the C bindings in the first place. And, why only 3 or 4 undefined symbols, when I have used quite a few MPI routines? (Pondering these issues again leads me to wonder about something buried in openmpi header files.)

Below is the link command for elliptic_mp. As I suspected, (Apple/Xcode) g++ is used as the linker, not mpi_cxx. For completeness, I have first listed a compile command for one of the object files (helmholtz.o) which complains about undefined symbols. The -DMPI_EX is a definition issued by me.

[ 54%] Building CXX object CMakeFiles/elliptic_mp.dir/elliptic/helmholtz.cpp.o
g++ -DMPI_EX -I/Users/hmb/develop-git/semtex-xxt/veclib -I/Users/hmb/develop-git/semtex-xxt/femlib -I/opt/local/include/openmpi-mp -I/Users/hmb/develop-git/semtex-xxt/src -I/Users/hmb/develop-git/semtex-xxt/elliptic -w -std=c++11 -O3 -DNDEBUG -arch arm64 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.2.sdk -MD -MT CMakeFiles/elliptic_mp.dir/elliptic/helmholtz.cpp.o -MF CMakeFiles/elliptic_mp.dir/elliptic/helmholtz.cpp.o.d -o CMakeFiles/elliptic_mp.dir/elliptic/helmholtz.cpp.o -c /Users/hmb/develop-git/semtex-xxt/elliptic/helmholtz.cpp

and

[ 55%] Linking CXX executable elliptic_mp
/opt/local/bin/cmake -E cmake_link_script CMakeFiles/elliptic_mp.dir/link.txt --verbose=1
g++   -w -std=c++11 -O3 -DNDEBUG -arch arm64 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.2.sdk -Wl,-search_paths_first -Wl,-headerpad_max_install_names CMakeFiles/elliptic_mp.dir/src/auxfield.cpp.o CMakeFiles/elliptic_mp.dir/src/data2df.cpp.o CMakeFiles/elliptic_mp.dir/src/domain.cpp.o CMakeFiles/elliptic_mp.dir/src/geometry.cpp.o CMakeFiles/elliptic_mp.dir/src/mesh.cpp.o CMakeFiles/elliptic_mp.dir/src/message.cpp.o CMakeFiles/elliptic_mp.dir/elliptic/helmholtz.cpp.o CMakeFiles/elliptic_mp.dir/elliptic/drive.cpp.o -o elliptic_mp   -L/opt/local/lib/gcc12/gcc/arm64-apple-darwin23/12.3.0  -L/opt/local/lib/gcc12  src/libsrc.a femlib/libfem.a veclib/libvec.a /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.2.sdk/usr/lib/libblas.tbd /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.2.sdk/usr/lib/liblapack.tbd /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.2.sdk/usr/lib/libblas.tbd /opt/local/lib/openmpi-mp/libmpi.dylib /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.2.sdk/usr/lib/liblapack.tbd /opt/local/lib/openmpi-mp/libmpi.dylib -lgfortran -lemutls_w -lgcc -lquadmath -lemutls_w -lgcc -lgcc 

Finally, here again are the undefined symbol messages:

-macosx_version_min has been renamed to -macos_version_min
Undefined symbols for architecture arm64:
  "__ZN3MPI3Win4FreeEv", referenced from:
      __ZTVN3MPI3WinE in auxfield.cpp.o
      __ZTVN3MPI3WinE in data2df.cpp.o
      __ZTVN3MPI3WinE in domain.cpp.o
      __ZTVN3MPI3WinE in geometry.cpp.o
      __ZTVN3MPI3WinE in mesh.cpp.o
      __ZTVN3MPI3WinE in message.cpp.o
      __ZTVN3MPI3WinE in helmholtz.cpp.o
      ...
  "__ZN3MPI4CommC2Ev", referenced from:
      __ZNK3MPI9Intracomm5CloneEv in auxfield.cpp.o
      __ZNK3MPI9Graphcomm5CloneEv in auxfield.cpp.o
      __ZNK3MPI8Cartcomm3SubEPKb in auxfield.cpp.o
      __ZNK3MPI9Intracomm12Create_graphEiPKiS2_b in auxfield.cpp.o
      __ZNK3MPI8Cartcomm5CloneEv in auxfield.cpp.o
      __ZNK3MPI9Intracomm11Create_cartEiPKiPKbb in auxfield.cpp.o
      __ZNK3MPI9Intercomm5MergeEb in auxfield.cpp.o
      ...
  "__ZN3MPI8Datatype4FreeEv", referenced from:
      __ZTVN3MPI8DatatypeE in auxfield.cpp.o
      __ZTVN3MPI8DatatypeE in data2df.cpp.o
      __ZTVN3MPI8DatatypeE in domain.cpp.o
      __ZTVN3MPI8DatatypeE in geometry.cpp.o
      __ZTVN3MPI8DatatypeE in mesh.cpp.o
      __ZTVN3MPI8DatatypeE in message.cpp.o
      __ZTVN3MPI8DatatypeE in helmholtz.cpp.o
      ...
  "_ompi_mpi_cxx_op_intercept", referenced from:
      __ZN3MPI2Op4InitEPFvPKvPviRKNS_8DatatypeEEb in auxfield.cpp.o
      __ZN3MPI2Op4InitEPFvPKvPviRKNS_8DatatypeEEb in data2df.cpp.o
      __ZN3MPI2Op4InitEPFvPKvPviRKNS_8DatatypeEEb in domain.cpp.o
      __ZN3MPI2Op4InitEPFvPKvPviRKNS_8DatatypeEEb in geometry.cpp.o
      __ZN3MPI2Op4InitEPFvPKvPviRKNS_8DatatypeEEb in mesh.cpp.o
      __ZN3MPI2Op4InitEPFvPKvPviRKNS_8DatatypeEEb in message.cpp.o
      __ZN3MPI2Op4InitEPFvPKvPviRKNS_8DatatypeEEb in helmholtz.cpp.o
      ...
ld: symbol(s) not found for architecture arm64
collect2: error: ld returned 1 exit status
make[2]: *** [elliptic_mp] Error 1
make[1]: *** [CMakeFiles/elliptic_mp.dir/all] Error 2
make: *** [all] Error 2

Thanks again for your thoughts.

ggouaillardet commented 9 months ago

Thanks for the feedback.

The easiest workaround is probably to cmake -DCMAKE_CXX_COMPILER=mpicxx

A slightly better one should be to understand why cmake did not pick libmpi_cxx.so (it might be because the CMake files only request the C bindings for MPI instead of the C++ ones.

Or you can pass -DOMPI_SKIP_MPICXX to the C++ compiler

FWIW

$ cat foo.cc
#include <mpi.h>
$  ~/local/ompi-v4.1.x/bin/mpicxx -c foo.cc
$ nm -u foo.o | grep MPI | grep -v _MPI_
__ZN3MPI3Win4FreeEv
__ZN3MPI4CommC2Ev
__ZN3MPI8Datatype4FreeEv

so yeah, some undefined C++ symbols are generated even if they are not used!

floquet-cxx commented 9 months ago

Pardon me, but I think these outcomes point to an error in the openmpi preprocessing system. As a slightly more extended example, albeit using g++, not mpicxx.

 semtex-xxt (xxt) >$ cat foo.cpp
#include <mpi.h>

int main() { return 0 ; }

Now, what happens withj openmpi headers and g++:

semtex-xxt (xxt) >$ g++ -I /opt/local/include/openmpi-mp foo.cpp
-macosx_version_min has been renamed to -macos_version_min
Undefined symbols for architecture arm64:
  "_MPI_Abort", referenced from:
      __ZN3MPI4Comm5AbortEi in ccvzKD2X.o
  "_MPI_Accumulate", referenced from:
      __ZNK3MPI3Win10AccumulateEPKviRKNS_8DatatypeEiliS5_RKNS_2OpE in ccvzKD2X.o
  "_MPI_Allgather", referenced from:
      __ZNK3MPI4Comm9AllgatherEPKviRKNS_8DatatypeEPviS5_ in ccvzKD2X.o
  "_MPI_Allgatherv", referenced from:
      __ZNK3MPI4Comm10AllgathervEPKviRKNS_8DatatypeEPvPKiS8_S5_ in ccvzKD2X.o
  "_MPI_Allreduce", referenced from:
      __ZNK3MPI4Comm9AllreduceEPKvPviRKNS_8DatatypeERKNS_2OpE in ccvzKD2X.o
  "_MPI_Alltoall", referenced from:
      __ZNK3MPI4Comm8AlltoallEPKviRKNS_8DatatypeEPviS5_ in ccvzKD2X.o
  "_MPI_Alltoallv", referenced from:
      __ZNK3MPI4Comm9AlltoallvEPKvPKiS4_RKNS_8DatatypeEPvS4_S4_S7_ in ccvzKD2X.o
  "_MPI_Alltoallw", referenced from:
      __ZNK3MPI4Comm9AlltoallwEPKvPKiS4_PKNS_8DatatypeEPvS4_S4_S7_ in ccvzKD2X.o
  "_MPI_Barrier", referenced from:
      __ZNK3MPI4Comm7BarrierEv in ccvzKD2X.o
  "_MPI_Bcast", referenced from:
      __ZNK3MPI4Comm5BcastEPviRKNS_8DatatypeEi in ccvzKD2X.o
  "_MPI_Bsend", referenced from:
      __ZNK3MPI4Comm5BsendEPKviRKNS_8DatatypeEii in ccvzKD2X.o
  "_MPI_Bsend_init", referenced from:
      __ZNK3MPI4Comm10Bsend_initEPKviRKNS_8DatatypeEii in ccvzKD2X.o
  "_MPI_Cancel", referenced from:
      __ZNK3MPI7Request6CancelEv in ccvzKD2X.o
  "_MPI_Cart_coords", referenced from:
      __ZNK3MPI8Cartcomm10Get_coordsEiiPi in ccvzKD2X.o
  "_MPI_Cart_create", referenced from:
      __ZNK3MPI9Intracomm11Create_cartEiPKiPKbb in ccvzKD2X.o
...
... Around 100 lines of messages ...
...
  "__ZN3MPI3Win4FreeEv", referenced from:
      __ZTVN3MPI3WinE in ccvzKD2X.o
  "__ZN3MPI4CommC2Ev", referenced from:
      __ZN3MPI9IntracommC2Ev in ccvzKD2X.o
      __ZN3MPI9IntracommC1EP19ompi_communicator_t in ccvzKD2X.o
  "__ZN3MPI8Datatype4FreeEv", referenced from:
      __ZTVN3MPI8DatatypeE in ccvzKD2X.o
  "_ompi_mpi_comm_null", referenced from:
      __ZN3MPI9IntracommC1EP19ompi_communicator_t in ccvzKD2X.o
      __ZN3MPI8CartcommC1ERKP19ompi_communicator_t in ccvzKD2X.o
      __ZN3MPI9GraphcommC1ERKP19ompi_communicator_t in ccvzKD2X.o
  "_ompi_mpi_cxx_op_intercept", referenced from:
      __ZN3MPI2Op4InitEPFvPKvPviRKNS_8DatatypeEEb in ccvzKD2X.o
  "_ompi_op_set_cxx_callback", referenced from:
      __ZN3MPI2Op4InitEPFvPKvPviRKNS_8DatatypeEEb in ccvzKD2X.o
ld: symbol(s) not found for architecture arm64
collect2: error: ld returned 1 exit status

I suggest that is clearly a problem. Since I didn't ask for any MPI routines, nothing MPI-related should arise. Here is what occurs with mpich headers:

semtex-xxt (xxt) >$ g++ -I /opt/local/include/mpich-gcc12 foo.cpp
-macosx_version_min has been renamed to -macos_version_min
semtex-xxt (xxt) >$ nm a.out | grep MPI
semtex-xxt (xxt) >$ 

Exactly what I'd expect should happen. I'm unsure about the macosx warning but think it comes from Xcode/g++ and is unrelated.

ggouaillardet commented 9 months ago

I don't know...

Unlike MPICH, Open MPI has a lot of C++ inlined subroutines/constructors that invoke the C bindings. Even if I was not able to evidence this with a small reproducer, the fact is a lot of undefined references get pulled by the compiler. I would hope the compiler get rid of the unused inline subroutines, but I am not sure this is a valid expectation, nor something Open MPI did wrong.

Anyway, I strongly doubt this issue will be adressed so I suggest you use one of the described workarounds. You can also upgrade to Open MPI 5 or rebuilt Open MPI without --enable-mpi-cxx

floquet-cxx commented 9 months ago

(Setting the C++ compiler to mpicxx did fix some issues, but I still couldn't get everything to work, since I use C and F77 too. I had trouble getting it all up and running.)

HOWEVER: I think I missed a step in my macports setup, which was to install port openmpi-default. That is meant to set all the compilers correctly/consistently. Using the equivalent cured my problems with mpich, so I can believe it will work with openmpi too. (The macports documentation could use some improvement! But, this is entirely understandable.) Thank you for your explanations and patience.