Open jrobcary opened 5 months ago
@ndellingwood for Amesos2
@jrobcary Is this with Visual Studio, WSL or what?
Visual Studio 2022, CUDA-12.3. On cygwin but building pure Wndows app (cygwin launches cmake, but then it's pure Windows.)
I've never built Trilinos on Windows or used VS and I'm not sure how to help, haven't seen this with builds on Linux. The trilinos_amd_*
type missing symbols are Amesos2 specific, but I don't know of any testing of Trilinos on Windows (Cuda or otherwise) and I wouldn't be surprised of issues occurring with other dependencies of Amesos2. AFAIK standalone Kokkos (the repo, outside of Trilinos) has CI testing for MSVC Cuda builds which is the only coverage.
One suggestion I can offer is to test builds enabling one dependency at a time, before hitting Amesos2; this won't resolve the trilinos_amd_*
missing symbol, but would status check up to the point of Amesos2.
So, enable:
Just to be clear, Trilinos builds just fine. It is at the point of using trilinos in another package that there is a failure. Ie, when linking into an executable, the missing symbol is exposed. However, just linking the library (which creates static libs) is not a problem
@jrobcary are you also building the tests in Trilinos without any issue?
Good suggestion for at least one of the link errors. Building with tests shows
lld-link: error: undefined symbol: void __cdecl Tpetra::Import_Util::sortAndMergeCrsEntries<class Kokkos::View<unsigned int64 , struct Kokkos::Device<class Kokkos::Serial, class Kokkos::HostSpace>>, class Kokkos::View<int , struct Kokkos::Device<class Kokkos::Serial, class Kokkos::HostSpace>>, class Kokkos::View<double *, struct Kokkos::Device<class Kokkos::Serial, class Kokkos::HostSpace>>>(class Kokkos::View<unsigned int64 , struct Kokkos::Device<class Kokkos::Serial, class Kokkos::HostSpace>> const &, class Kokkos::View<int , struct Kokkos::Device<class Kokkos::Serial, class Kokkos::HostSpace>> const &, class Kokkos::View<double *, struct Kokkos::Device<class Kokkos::Serial, class Kokkos::HostSpace>> const &)
referenced by tpetra.lib(Tpetra_CrsMatrix_DOUBLE_INT_LONG_LONG_SERIAL.cpp.obj):($ehgcr_303_808)
So it looks like a template instantiation of sortAndMergeCrsEntries is missing. I see this method defined in tpetra/core/src/Tpetra_Import_Util2.hpp:sortAndMergeCrsEntries
It is also used in xpetra with a note, ./xpetra/src/Utils/Xpetra_CrsMatrixUtils.hpp: if(lib == Xpetra::UseEpetra) {
throw(Xpetra::Exceptions::RuntimeError("Xpetra::CrsMatrixUtils::sortCrsEntries only available for GO=int or GO=long long with EpetraNode (Serial or OpenMP depending on configuration)"));
} else if(lib == Xpetra::UseTpetra) {
Tpetra::Import_Util::sortCrsEntries(CRS_rowptr, CRS_colind, CRS_vals);
}
There is no further explanation of why this is turned off, and in any case, it is done at runtime, not compile time.
I tried turning of Xpetra, but at configuration, this gave
*** ERROR: Setting Trilinos_ENABLE_MueLu=OFF which was 'ON' because MueLu has a required library dependence on disabled package Xpetra!
Ugh. Looks like an ETI issue
Tough for me to figure out. I see methods defined in Tpetra_Import_Util2.hpp, eg, at line 656, 732, 742, but the latter if ifndef'd out for Windows at line 738. Why? I commented out the ifndef's, but the result remains the same.
The calling source file, Tpetra_CrsMatrix_LONG_LONG_INT_LONG_LONG_SERIAL.cpp, contains only
namespace Tpetra {
TPETRA_ETI_MANGLING_TYPEDEFS()
TPETRA_CRSMATRIX_INSTANT( longlong, int, longlong, Tpetra_KokkosCompat_KokkosSerialWrapperNode )
}
In Tpetra_CrsMatrix_def.hpp, there are calls to sortCrsEntries and sortAndMergeCrsEntries with the same args, but only the latter is a missing symbol. It's as if the first was propertly instantiated, but the second was not.
Could this be the result of SFNAE in the below? Looks like the Windows compiler does not like
no type named const_type' in 'Teuchos::ArrayView<unsigned long long>
In file included from D:\winsame\cary\xsd2\builds\trilinos-15.1.1\ser\packages\tpetra\core\src\Tpetra_CrsGraph_INT_LONG_LONG_SERIAL.cpp:67:
In file included from D:\winsame\cary\xsd2\builds\trilinos-15.1.1\packages\tpetra\core\src\Tpetra_CrsGraph_def.hpp:60:
D:\winsame\cary\xsd2\builds\trilinos-15.1.1\packages\tpetra\core\src/Tpetra_Import_Util2.hpp(750,3): error: no matching function for call to 'sort_and_merge_matrix'
750 | KokkosSparse::sort_and_merge_matrix<execution_space, rowptr_view_type, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 751 | colind_view_type, vals_view_type>(CRS_rowptr_in, CRS_colind_in, CRS_vals_in,
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
D:\winsame\cary\xsd2\builds\trilinos-15.1.1\packages\kokkos-kernels\sparse\src\KokkosSparse_SortCrs.hpp(680,6): note: candidate template ignored: substitution failure [with exec_space = execution_space, rowmap_t = const Teuchos::ArrayView<unsigned long long>, entries_t = const Teuchos::ArrayView<int>, values_t = Teuchos::ArrayView<double>]: no type named 'const_type' in 'Teuchos::ArrayView<unsigned long long>'
680 | void sort_and_merge_matrix(const typename rowmap_t::const_type& rowmap_in,
| ^
and indeed there is no const_type in Teuchos_ArrayViewDecl.hpp. Should
const typename rowmap_t::const_type& rowmap_in
be
const typename rowmap_t::value_type& rowmap_in
(There is already a const modifier.)
I fixed this in c43bd4c.
The link error with sortAndMergeCrsEntries
was due to the fact that the one-template argument version here should be calling the two-template argument version, but MSVC (as the host compiler for nvcc) was instead deducing the three-template argument version. To avoid that, I had previously just ifdef'd out the definition of the three-template argument version on Windows, but that turned out to be needed by another function, causing a link error. So I changed the call in the one-template argument version to explicitly specify two template arguments.
The missing trilinos_amd_*
missing symbols were due to some C source files being compiled as CUDA sources. I fixed that in the CMake code.
@benc303 thanks for the helpful info, could you provide a PR with the changes in the sha you referenced (if a PR is not in progress already)? It would be helpful to get Ross's feedback on the TriBITS update.
I have looked through the current code, and there have been sufficient changes that this will not be easy. Specifically, before, TRIBITS_NOKOKKOS_PACKAGES was defined in cmake/tribits/core/package_arch/TribitsAddLibrary.cmake, and tribits_add_library() was called for all non-kokkos libs. Now TRIBITS_NOKOKKOS_PACKAGES is no longer even in the distribution, it seems. So what is the equivalent?
@jrobcary If by "before" you mean the winCuda branch and by "now" you mean develop, this makes sense: I introduced TRIBITS_NOKOKKOS_PACKAGES
in winCuda to control whether set_source_files_properties(... LANGUAGE ...)
was called. So that variable should be kept if I rebase on develop. Whether it works with current TriBITS is another question though.
Got it. I was trying to port just a few or the changes into 15.1.1 to get a non-CUDA build on windows, but that may be more difficult than I envisioned.
Bug Report
@csiefer2 Attempt to build trilinos with CUDA on Windows. This is a fork, cary@grating/.../trilinos-alturas$ git remote -v origin https://github.com/Alturas-Research/Trilinos.git (fetch) origin https://github.com/Alturas-Research/Trilinos.git, created off of 7c431d00fa6873e55b604b8c5fda358f3f050f21. A fork from that with a few changes for clang is cary@grating/.../trilinos-txc$ git remote -v origin https://github.com/Tech-XCorp/Trilinos.git (fetch) origin https://github.com/Tech-XCorp/Trilinos.git (push) When I try to link that to our code, I get the undefined symbols below. In this issue, I want to consider just the trilinosamd* errors. Is there a way to not have tpetra link to them?
Description
All trilinosamd* missing symbols: lld-link: error: undefined symbol: trilinos_amd_calloc lld-link: error: undefined symbol: trilinos_amd_free lld-link: error: undefined symbol: trilinos_amd_malloc lld-link: error: undefined symbol: trilinos_amd_realloc
There is another link error not to be considered for now IGNORE: lld-link: error: undefined symbol: void __cdecl Tpetra::Import_Util::sortAndMergeCrsEntries<class Kokkos::View<unsigned int64 , struct Kokkos::Device<class Kokkos::Cuda, class Kokkos::CudaUVMSpace>>, class Kokkos::View<int , struct Kokkos::Device<class Kokkos::Cuda, class Kokkos::CudaUVMSpace>>, class Kokkos::View<int *, struct Kokkos::Device<class Kokkos::Cuda, class Kokkos::CudaUVMSpace>>>(class Kokkos::View<unsigned int64 , struct Kokkos::Device<class Kokkos::Cuda, class Kokkos::CudaUVMSpace>> const &, class Kokkos::View<int , struct Kokkos::Device<class Kokkos::Cuda, class Kokkos::CudaUVMSpace>> const &, class Kokkos::View<int *, struct Kokkos::Device<class Kokkos::Cuda, class Kokkos::CudaUVMSpace>> const &)^M
Steps to Reproduce