trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.19k stars 564 forks source link

Tpetra, Amesos2: missing symbols on Windows Build #13069

Open jrobcary opened 2 months ago

jrobcary commented 2 months ago

Bug Report

@csiefer2 Attempt to build trilinos with CUDA on Windows. This is a fork, cary@grating/.../trilinos-alturas$ git remote -v origin https://github.com/Alturas-Research/Trilinos.git (fetch) origin https://github.com/Alturas-Research/Trilinos.git, created off of 7c431d00fa6873e55b604b8c5fda358f3f050f21. A fork from that with a few changes for clang is cary@grating/.../trilinos-txc$ git remote -v origin https://github.com/Tech-XCorp/Trilinos.git (fetch) origin https://github.com/Tech-XCorp/Trilinos.git (push) When I try to link that to our code, I get the undefined symbols below. In this issue, I want to consider just the trilinosamd* errors. Is there a way to not have tpetra link to them?

Description

lld-link: error: undefined symbol: trilinos_amd_malloc
>>> referenced by amesos2.lib(Amesos2_KLU2.cpp.obj):(struct KLU2::klu_symbolic * __cdecl KLU2::order_and_analyze(int, int *const, int *const, struct KLU2::klu_common *))
>>> referenced by amesos2.lib(Amesos2_KLU2.cpp.obj):(struct KLU2::klu_symbolic * __cdecl KLU2::order_and_analyze(int, int *const, int *const, struct KLU2::klu_common *))
>>> referenced by trilinosss.lib(trilinos_amd_order.c.obj):($LN90)
>>> referenced 5 more times

All trilinosamd* missing symbols: lld-link: error: undefined symbol: trilinos_amd_calloc lld-link: error: undefined symbol: trilinos_amd_free lld-link: error: undefined symbol: trilinos_amd_malloc lld-link: error: undefined symbol: trilinos_amd_realloc

There is another link error not to be considered for now IGNORE: lld-link: error: undefined symbol: void __cdecl Tpetra::Import_Util::sortAndMergeCrsEntries<class Kokkos::View<unsigned int64 , struct Kokkos::Device<class Kokkos::Cuda, class Kokkos::CudaUVMSpace>>, class Kokkos::View<int , struct Kokkos::Device<class Kokkos::Cuda, class Kokkos::CudaUVMSpace>>, class Kokkos::View<int *, struct Kokkos::Device<class Kokkos::Cuda, class Kokkos::CudaUVMSpace>>>(class Kokkos::View<unsigned int64 , struct Kokkos::Device<class Kokkos::Cuda, class Kokkos::CudaUVMSpace>> const &, class Kokkos::View<int , struct Kokkos::Device<class Kokkos::Cuda, class Kokkos::CudaUVMSpace>> const &, class Kokkos::View<int *, struct Kokkos::Device<class Kokkos::Cuda, class Kokkos::CudaUVMSpace>> const &)^M

referenced by tpetra.lib(Tpetra_CrsMatrix_INT_INT_LONG_LONG_CUDA.cpp.obj):($LN717)^M

Steps to Reproduce

  1. SHA1: [insert here]
  2. Configure sc grating.txcorp.com-trilinos-sercuda-config.sh.txt ript: [attach here]
  3. Configure log: [attach h grating.txcorp.com-trilinos-sercuda-config.txt ere]
  4. Build log: [attach here] grating.txcorp.com-trilinos-sercuda-build.txt
  5. Link into another application.
csiefer2 commented 2 months ago

@ndellingwood for Amesos2

csiefer2 commented 2 months ago

@jrobcary Is this with Visual Studio, WSL or what?

jrobcary commented 2 months ago

Visual Studio 2022, CUDA-12.3. On cygwin but building pure Wndows app (cygwin launches cmake, but then it's pure Windows.)

ndellingwood commented 2 months ago

I've never built Trilinos on Windows or used VS and I'm not sure how to help, haven't seen this with builds on Linux. The trilinos_amd_* type missing symbols are Amesos2 specific, but I don't know of any testing of Trilinos on Windows (Cuda or otherwise) and I wouldn't be surprised of issues occurring with other dependencies of Amesos2. AFAIK standalone Kokkos (the repo, outside of Trilinos) has CI testing for MSVC Cuda builds which is the only coverage.

One suggestion I can offer is to test builds enabling one dependency at a time, before hitting Amesos2; this won't resolve the trilinos_amd_* missing symbol, but would status check up to the point of Amesos2. So, enable:

  1. Kokkos; if successful, then also enable:
  2. Kokkos Kernels; if successful, then also enable:
  3. Tpetra (and Epetra, since your configuration enables Trilinos_ENABLE_EpetraExt); if successful, then also enable:
  4. Amesos2 (will enable TrilinosSS, which is triggering the symbol error) - but if the above (and their dependencies) cannot be built, you won't be able to build Amesos2
jrobcary commented 2 months ago

Just to be clear, Trilinos builds just fine. It is at the point of using trilinos in another package that there is a failure. Ie, when linking into an executable, the missing symbol is exposed. However, just linking the library (which creates static libs) is not a problem

ndellingwood commented 2 months ago

@jrobcary are you also building the tests in Trilinos without any issue?

jrobcary commented 2 months ago

Good suggestion for at least one of the link errors. Building with tests shows

lld-link: error: undefined symbol: void __cdecl Tpetra::Import_Util::sortAndMergeCrsEntries<class Kokkos::View<unsigned int64 , struct Kokkos::Device<class Kokkos::Serial, class Kokkos::HostSpace>>, class Kokkos::View<int , struct Kokkos::Device<class Kokkos::Serial, class Kokkos::HostSpace>>, class Kokkos::View<double *, struct Kokkos::Device<class Kokkos::Serial, class Kokkos::HostSpace>>>(class Kokkos::View<unsigned int64 , struct Kokkos::Device<class Kokkos::Serial, class Kokkos::HostSpace>> const &, class Kokkos::View<int , struct Kokkos::Device<class Kokkos::Serial, class Kokkos::HostSpace>> const &, class Kokkos::View<double *, struct Kokkos::Device<class Kokkos::Serial, class Kokkos::HostSpace>> const &)

referenced by tpetra.lib(Tpetra_CrsMatrix_DOUBLE_INT_LONG_LONG_SERIAL.cpp.obj):($ehgcr_303_808)

So it looks like a template instantiation of sortAndMergeCrsEntries is missing. I see this method defined in tpetra/core/src/Tpetra_Import_Util2.hpp:sortAndMergeCrsEntries

It is also used in xpetra with a note, ./xpetra/src/Utils/Xpetra_CrsMatrixUtils.hpp: if(lib == Xpetra::UseEpetra) {

if defined(HAVE_XPETRA_EPETRA)

    throw(Xpetra::Exceptions::RuntimeError("Xpetra::CrsMatrixUtils::sortCrsEntries only available for GO=int or GO=long long with EpetraNode (Serial or OpenMP depending on configuration)"));

endif // HAVE_XPETRA_EPETRA

  } else if(lib == Xpetra::UseTpetra) {

ifdef HAVE_XPETRA_TPETRA

    Tpetra::Import_Util::sortCrsEntries(CRS_rowptr, CRS_colind, CRS_vals);

endif // HAVE_XPETRA_TPETRA

  }

There is no further explanation of why this is turned off, and in any case, it is done at runtime, not compile time.

I tried turning of Xpetra, but at configuration, this gave

*** ERROR: Setting Trilinos_ENABLE_MueLu=OFF which was 'ON' because MueLu has a required library dependence on disabled package Xpetra!

csiefer2 commented 2 months ago

Ugh. Looks like an ETI issue

jrobcary commented 2 months ago

Tough for me to figure out. I see methods defined in Tpetra_Import_Util2.hpp, eg, at line 656, 732, 742, but the latter if ifndef'd out for Windows at line 738. Why? I commented out the ifndef's, but the result remains the same.

The calling source file, Tpetra_CrsMatrix_LONG_LONG_INT_LONG_LONG_SERIAL.cpp, contains only

namespace Tpetra {

  TPETRA_ETI_MANGLING_TYPEDEFS()

  TPETRA_CRSMATRIX_INSTANT( longlong, int, longlong, Tpetra_KokkosCompat_KokkosSerialWrapperNode )

}

In Tpetra_CrsMatrix_def.hpp, there are calls to sortCrsEntries and sortAndMergeCrsEntries with the same args, but only the latter is a missing symbol. It's as if the first was propertly instantiated, but the second was not.

jrobcary commented 2 months ago

Could this be the result of SFNAE in the below? Looks like the Windows compiler does not like

no type named const_type' in 'Teuchos::ArrayView<unsigned long long>

In file included from D:\winsame\cary\xsd2\builds\trilinos-15.1.1\ser\packages\tpetra\core\src\Tpetra_CrsGraph_INT_LONG_LONG_SERIAL.cpp:67:
In file included from D:\winsame\cary\xsd2\builds\trilinos-15.1.1\packages\tpetra\core\src\Tpetra_CrsGraph_def.hpp:60:
D:\winsame\cary\xsd2\builds\trilinos-15.1.1\packages\tpetra\core\src/Tpetra_Import_Util2.hpp(750,3): error: no matching function for call to 'sort_and_merge_matrix'
  750 |   KokkosSparse::sort_and_merge_matrix<execution_space, rowptr_view_type,      |   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~  751 |                                       colind_view_type, vals_view_type>(CRS_rowptr_in, CRS_colind_in, CRS_vals_in,
      |                                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
D:\winsame\cary\xsd2\builds\trilinos-15.1.1\packages\kokkos-kernels\sparse\src\KokkosSparse_SortCrs.hpp(680,6): note: candidate template ignored: substitution failure [with exec_space = execution_space, rowmap_t = const Teuchos::ArrayView<unsigned long long>, entries_t = const Teuchos::ArrayView<int>, values_t = Teuchos::ArrayView<double>]: no type named 'const_type' in 'Teuchos::ArrayView<unsigned long long>'
  680 | void sort_and_merge_matrix(const typename rowmap_t::const_type& rowmap_in,
      |      ^  

and indeed there is no const_type in Teuchos_ArrayViewDecl.hpp. Should

const typename rowmap_t::const_type& rowmap_in

be

const typename rowmap_t::value_type& rowmap_in

(There is already a const modifier.)

benc303 commented 2 months ago

I fixed this in c43bd4c.

The link error with sortAndMergeCrsEntries was due to the fact that the one-template argument version here should be calling the two-template argument version, but MSVC (as the host compiler for nvcc) was instead deducing the three-template argument version. To avoid that, I had previously just ifdef'd out the definition of the three-template argument version on Windows, but that turned out to be needed by another function, causing a link error. So I changed the call in the one-template argument version to explicitly specify two template arguments.

The missing trilinos_amd_* missing symbols were due to some C source files being compiled as CUDA sources. I fixed that in the CMake code.

ndellingwood commented 2 months ago

@benc303 thanks for the helpful info, could you provide a PR with the changes in the sha you referenced (if a PR is not in progress already)? It would be helpful to get Ross's feedback on the TriBITS update.

jrobcary commented 2 months ago

I have looked through the current code, and there have been sufficient changes that this will not be easy. Specifically, before, TRIBITS_NOKOKKOS_PACKAGES was defined in cmake/tribits/core/package_arch/TribitsAddLibrary.cmake, and tribits_add_library() was called for all non-kokkos libs. Now TRIBITS_NOKOKKOS_PACKAGES is no longer even in the distribution, it seems. So what is the equivalent?

benc303 commented 2 months ago

@jrobcary If by "before" you mean the winCuda branch and by "now" you mean develop, this makes sense: I introduced TRIBITS_NOKOKKOS_PACKAGES in winCuda to control whether set_source_files_properties(... LANGUAGE ...) was called. So that variable should be kept if I rebase on develop. Whether it works with current TriBITS is another question though.

jrobcary commented 2 months ago

Got it. I was trying to port just a few or the changes into 15.1.1 to get a non-CUDA build on windows, but that may be more difficult than I envisioned.