trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.19k stars 565 forks source link

Tpetra: compilation error with Cuda 12.2 and GCC 12.3 #12237

Open maartenarnst opened 1 year ago

maartenarnst commented 1 year ago

@brian-kelley @csiefer2

We are compiling Trilinos with Cuda 12.2 and with GCC 12.3 as the host compiler.

We're seeing a compilation error for the file Tpetra_CrsMatrix_def.hpp:

INFO:root:#21 1477.5 ...build-with-GNU-Cuda-amd64/packages/tpetra/core/src/Tpetra_CrsMatrix_LONG_LONG_INT_LONG_LONG_SERIAL.cpp:74:16:   required from here
INFO:root:#21 1477.5 .../packages/tpetra/core/src/Tpetra_CrsMatrix_def.hpp:1689:139: error: no type named 'const_type' in 'cuda::std::__4::__is_primary_template<cuda::std::__4::iterator_traits<long long int, void> >'
INFO:root:#21 1477.5  1689 |         typename row_entries_type::const_type numRowEnt_h =
INFO:root:#21 1477.5       | 

It's quite mysterious because row_entries_type seems to be a Kokkos::View, so the line should be fine.

I think I tracked it down to this commit:

which ultimately results in an include of Kokkos_Sort.hpp in Tpetra_CrsMatrix_def.hpp. It seems to be the include of thrust/device_ptr.h and thrust/sort.h from Kokkos_Sort.hpp that ultimately causes the issue. I.e., if I compile using an older version of Trilinos and add those two thrust includes to Tpetra_CrsMatrix_def.hpp, I get the same error.

This is just a bug report. I have no explanation for this compilation error. And no fix to propose.

jhux2 commented 1 year ago

@trilinos/tpetra

csiefer2 commented 1 year ago

Lovely. We currently do not test CUDA 12 or GCC 12. Any suggestion as to where we can find a machine to reproduce this?

brian-kelley commented 1 year ago

@csiefer2 Weaver has both of those

csiefer2 commented 11 months ago

@maartenarnst I finally got around to trying a build on weaver (IBM Power) w/ GCC 12.2, Cuda 12.0 and OpenMPI 4.1.4 and Tpetra compiles just fine. Can you post whatever configure you used here so I can see if it is some magic option thing or if it really is some compiler issue in 12.3 that isn't in 12.2

maartenarnst commented 11 months ago

Hi @csiefer2. Thanks for following up.

We're building and testing in a docker container based on the cuda:12.2.0-devel-ubuntu22.04 image. It's x86, GCC 12.3, Cuda 12.2 and openmpi 4.1.2.

I'll try to run checks again tomorrow with our configuration, as well as with gcc 12.2 and cuda 12.0 that your are using. I'll keep you updated, and I'll also send more details.

Also tagging @romintomasetti.

csiefer2 commented 11 months ago

@maartenarnst

Configure I used:

cmake  \
-D CMAKE_CXX_COMPILER=`which mpicxx` \
-D CMAKE_C_COMPILER=`which mpicc` \
-D TPL_ENABLE_MPI=ON \
-D TPL_ENABLE_CUDA=ON \
   -D Kokkos_ARCH_VOLTA70=ON \
-D BUILD_SHARED_LIBS=ON \
-D Trilinos_ENABLE_Epetra=OFF \
-D Trilinos_ENABLE_Tpetra=ON \
  -D Tpetra_ENABLE_TESTS=ON \
  -D Tpetra_ENABLE_EXAMPLES=ON \
-D TPL_BLAS_LIBRARIES=$OPENBLAS_LIB/libopenblas.so \
-D TPL_LAPACK_LIBRARIES=$OPENBLAS_LIB/libopenblas.so \
../Trilinos
bathmatt commented 10 months ago

I know what is causing this and it is a bug in the compiler.

There is a bug filed with the nvcc team. It does not (or should not) happen with nvc++ compiler.

I can provide a work around

typedef decltype (myGraph_->k_numRowEntries_) row_entries_type;
WITH
typedef typename Kokkos::View<size_t*, Kokkos::LayoutLeft, device_type>::HostMirror row_entries_type;

and it should go away. Like I said it is in our compiler and with decltype. (you need to update the using statement too)

I will link the internal bug with this ticket.

with these changes... [74/74] Linking CXX executable packages/panzer/mini-em/example/BlockPrec/PanzerMiniEM_BlockPrec.exe

BTW, i can't provide a patch without a lot of approval, or I would

bathmatt commented 10 months ago

Reproducer

template <class...> class c {
public:
  using ab = c;
};
class ac;
class ad;
typedef ac ai;
enum f { g };
template <class al> class h {
public:
  h(f = g);
  al *operator->();
};
template <class = ai> class j;
template <class, class, class> class k {
public:
  typedef c<> am;
  am l;
};
template <class, class an, class m, class ao> class n {
  using ap = j<>;
  using o = k<an, m, ao>;
  void aq(const h<const ap> &, const h<const ap> &, const h<ad> & = g);
  template <class al> h<n<al, an, m, ao>> ar() const;
  h<const ap> q() const;
  h<const ap> p() const;
  void at(const h<ad> &);
  h<o> t;
};
template <class b, b> struct av {};
typedef av<bool, false> e;
template <bool aw> using u = av<bool, aw>;
template <bool, class b = void> using r = b;
template <class, class> struct ae;
template <class b, class i> using s = u<ae<b, i>::a>;
template <template <class> class, class> e v;
template <template <class> class w, class... x> using ax = decltype(v<w, x...>);
template <class b> using ay = r<s<b, typename b::d>::a>;
template <class b> using __is_primary_template = ax<ay, b>;
template <class az, class an, class m, class ao>
void n<az, an, m, ao>::at(const h<ad> &) {
  typedef decltype(t->l) ba;
  typename ba::ab bb;
  bb;
}
template <class az, class an, class m, class ao>
void n<az, an, m, ao>::aq(const h<const ap> &, const h<const ap> &,
                          const h<ad> &bc) {
  at(bc);
}
template <class az, class an, class m, class ao>
template <class al>
h<n<al, an, m, ao>> n<az, an, m, ao>::ar() const {
  h<n> be;
  be->aq(q(), p());
}
template h<n<double, int, long, ai>> n<double, int, long, ai>::ar() const;
csiefer2 commented 10 months ago

@maartenarnst I have a PR up which blindly tries @bathmatt's fix. Can you check?

bathmatt commented 8 months ago

@csiefer2 et al, I just approved the fix in the compiler that should hit in 12.5 Sorry we couldn't fix it sooner.