ginkgo-project / ginkgo

Numerical linear algebra software package
https://ginkgo-project.github.io/
BSD 3-Clause "New" or "Revised" License
414 stars 88 forks source link

Can't build ginkgo 1.8.0 with HIP #1693

Closed lahwaacz closed 1 month ago

lahwaacz commented 1 month ago

Hi! I'm working on upgrading ginkgo-hpc for Arch Linux, so far I have the following build commands (omitted the parts for base and cuda packages):

  local common_cmake_flags=(
    -S $_pkgname-$pkgver -G Ninja
    -DCMAKE_BUILD_TYPE=None
    -DCMAKE_INSTALL_PREFIX=/usr
    -DGINKGO_BUILD_REFERENCE=ON
    -DGINKGO_BUILD_OMP=ON
    -DGINKGO_BUILD_MPI=ON
    -DGINKGO_HAVE_GPU_AWARE_MPI=ON
    -DGINKGO_BUILD_BENCHMARKS=ON
    -DGINKGO_BUILD_EXAMPLES=ON
    -DGINKGO_BUILD_DOC=ON
    -DGINKGO_BUILD_TESTS=ON
  )
  local _amdgpu_archs="gfx906"

  # -hip package
  # ginkgo has insufficient auto-detection for HIP_PATH https://github.com/ginkgo-project/ginkgo/issues/1624
  export ROCM_PATH=/opt/rocm
  export HIP_PATH="$ROCM_PATH"
  # Compile source code for supported GPU archs in parallel
  export HIPFLAGS="-parallel-jobs=$(nproc)"
  # Use gcc 13 toolchain as ROCm is not compatible with gcc 14.
  export HIPFLAGS="-parallel-jobs=$(nproc) --gcc-install-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/13.3.0/"
  cmake -B build-hip "${common_cmake_flags[@]}" \
    -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
    -DCMAKE_CXX_FLAGS="${CXXFLAGS} -fcf-protection=none" \
    -DCMAKE_HIP_ARCHITECTURES="$_amdgpu_archs" \
    -DGINKGO_BUILD_CUDA=OFF \
    -DGINKGO_BUILD_HIP=ON \
    -DGINKGO_BUILD_SYCL=OFF
  cmake --build build-hip --verbose

I've backported https://github.com/ginkgo-project/ginkgo/pull/1670/commits/eb97b4969c66ca0fa9e91c339c7dc409cb6a9143 but still get this error which does not seem to be fixed in math.hpp on develop:

/build/ginkgo-hpc/src/ginkgo-1.8.0/include/ginkgo/core/base/math.hpp:704:12: error: no matching function for call to 'zero'
  704 |     return zero<T>();
      |            ^~~~~~~
/build/ginkgo-hpc/src/ginkgo-1.8.0/common/unified/matrix/sellp_kernels.cpp:82:48: note: in instantiation of function template specialization 'gko::zero<std::complex<float>>' requested here
   82 |                     i < row_end ? in_vals[i] : zero(values[out_idx]);
      |                                                ^
/build/ginkgo-hpc/src/ginkgo-1.8.0/omp/base/kernel_launch.hpp:34:17: note: in instantiation of function template specialization 'gko::kernels::omp::sellp::fill_in_matrix_data(std::shared_ptr<const DefaultExecutor>, const device_matrix_data<complex<float>, int> &, const int64 *, matrix::Sellp<complex<float>, int> *)::(anonymous class)::operator()<long, const int *, const std::complex<float> *, const long *, unsigned long, const unsigned long *, int *, std::complex<float> *>' requested here
   34 |         [&]() { fn(i, args...); }();
      |                 ^
/build/ginkgo-hpc/src/ginkgo-1.8.0/omp/base/kernel_launch.hpp:34:15: note: while substituting into a lambda expression here
   34 |         [&]() { fn(i, args...); }();
      |               ^
/build/ginkgo-hpc/src/ginkgo-1.8.0/omp/base/kernel_launch.hpp:111:5: note: in instantiation of function template specialization 'gko::kernels::omp::(anonymous namespace)::run_kernel_impl<(lambda at /build/ginkgo-hpc/src/ginkgo-1.8.0/common/unified/matrix/sellp_kernels.cpp:67:9), const int *, const std::complex<float> *, const long *, unsigned long, const unsigned long *, int *, std::complex<float> *>' requested here
  111 |     run_kernel_impl(exec, fn, size, map_to_device(args)...);
      |     ^
/build/ginkgo-hpc/src/ginkgo-1.8.0/common/unified/matrix/sellp_kernels.cpp:65:5: note: in instantiation of function template specialization 'gko::kernels::omp::run_kernel<(lambda at /build/ginkgo-hpc/src/ginkgo-1.8.0/common/unified/matrix/sellp_kernels.cpp:67:9), const int *, const std::complex<float> *, const long *&, unsigned long, const unsigned long *, int *, std::complex<float> *>' requested here
   65 |     run_kernel(
      |     ^
/build/ginkgo-hpc/src/ginkgo-1.8.0/include/ginkgo/core/base/math.hpp:686:1: note: candidate template ignored: requirement '!std::is_same<std::complex<float>, std::complex<float>>::value' was not satisfied [with T = std::complex<float>]
  686 | zero()
      | ^
/build/ginkgo-hpc/src/ginkgo-1.8.0/include/ginkgo/core/base/math.hpp:628:33: note: candidate function not viable: call to __host__ function from __device__ function
  628 | GKO_INLINE __host__ constexpr T zero()
      |                                 ^
/build/ginkgo-hpc/src/ginkgo-1.8.0/include/ginkgo/core/base/math.hpp:702:35: note: candidate function template not viable: requires 1 argument, but 0 were provided
  702 | GKO_INLINE __device__ constexpr T zero(const T&)
      |                                   ^    ~~~~~~~~
/build/ginkgo-hpc/src/ginkgo-1.8.0/include/ginkgo/core/base/math.hpp:644:33: note: candidate function template not viable: requires 1 argument, but 0 were provided
  644 | GKO_INLINE __host__ constexpr T zero(const T&)
      |                                 ^    ~~~~~~~~

Note that this is part of a ROCm 6.2.2 rebuild, we were not able to build ginkgo 1.8.0 with ROCm 6.0. I'm also not sure about the -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc flag which was not needed before, but without it I get errors like this (maybe an ABI error when you try to link C++ code by GCC with HIP code?):

[1116/1485] : && /usr/bin/c++ -march=x86-64 -mtune=generic -O2 -pipe -fno-plt -fexceptions         -Wp,-D_FORTIFY_SOURCE=3 -Wformat -Werror=format-security         -fstack-clash-protection -fcf-protection         -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer -Wp,-D_GLIBCXX_ASSERTIONS -g -ffile-prefix-map=/build/ginkgo-hpc/src=/usr/src/debug/ginkgo-hpc -flto=auto -Wl,-O1 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now          -Wl,-z,pack-relative-relocs -flto=auto     -Wl,-rpath -Wl,/usr/lib -Wl,--enable-new-dtags examples/ir-ilu-preconditioned-solver/CMakeFiles/ir-ilu-preconditioned-solver.dir/ir-ilu-preconditioned-solver.cpp.o -o examples/ir-ilu-preconditioned-solver/ir-ilu-preconditioned-solver  -Wl,-rpath,/build/ginkgo-hpc/src/build-hip/lib:/opt/rocm/lib  lib/libginkgo.so.1.8.0  lib/libginkgo_omp.so.1.8.0  lib/libginkgo_cuda.so.1.8.0  lib/libginkgo_reference.so.1.8.0  lib/libginkgo_hip.so.1.8.0  lib/libginkgo_dpcpp.so.1.8.0  lib/libginkgo_device.so.1.8.0  /usr/lib/libmpi.so  -Wl,-rpath-link,/opt/rocm/lib && :
FAILED: examples/ir-ilu-preconditioned-solver/ir-ilu-preconditioned-solver
: && /usr/bin/c++ -march=x86-64 -mtune=generic -O2 -pipe -fno-plt -fexceptions         -Wp,-D_FORTIFY_SOURCE=3 -Wformat -Werror=format-security         -fstack-clash-protection -fcf-protection         -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer -Wp,-D_GLIBCXX_ASSERTIONS -g -ffile-prefix-map=/build/ginkgo-hpc/src=/usr/src/debug/ginkgo-hpc -flto=auto -Wl,-O1 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now          -Wl,-z,pack-relative-relocs -flto=auto     -Wl,-rpath -Wl,/usr/lib -Wl,--enable-new-dtags examples/ir-ilu-preconditioned-solver/CMakeFiles/ir-ilu-preconditioned-solver.dir/ir-ilu-preconditioned-solver.cpp.o -o examples/ir-ilu-preconditioned-solver/ir-ilu-preconditioned-solver  -Wl,-rpath,/build/ginkgo-hpc/src/build-hip/lib:/opt/rocm/lib  lib/libginkgo.so.1.8.0  lib/libginkgo_omp.so.1.8.0  lib/libginkgo_cuda.so.1.8.0  lib/libginkgo_reference.so.1.8.0  lib/libginkgo_hip.so.1.8.0  lib/libginkgo_dpcpp.so.1.8.0  lib/libginkgo_device.so.1.8.0  /usr/lib/libmpi.so  -Wl,-rpath-link,/opt/rocm/lib && :
/usr/bin/ld: /tmp/ccqgcF5r.ltrans2.ltrans.o: in function `gko::EnableDefaultFactory<gko::preconditioner::Jacobi<double, int>::Factory, gko::preconditioner::Jacobi<double, int>, gko::preconditioner::Jacobi<double, int>::parameters_type, gko::LinOpFactory>::generate_impl(std::shared_ptr<gko::LinOp const>) const':
/usr/include/c++/14.2.1/bits/shared_ptr.h:720:(.text+0x9452): undefined reference to `typeinfo for gko::HipExecutor'
/usr/bin/ld: /tmp/ccqgcF5r.ltrans2.ltrans.o:/usr/include/c++/14.2.1/bits/shared_ptr.h:720:(.text+0x99ee): undefined reference to `typeinfo for gko::HipExecutor'
/usr/bin/ld: lib/libginkgo_hip.so.1.8.0: undefined reference to `vtable for gko::HipExecutor'
collect2: error: ld returned 1 exit status
MarcelKoch commented 1 month ago

I could recreate the build issue with math.cpp using your settings and the rocm 6.2 image. I think I also have a fix, which I'm currently testing.

One comment on your cmake flags, by setting -DGINKGO_HAVE_GPU_AWARE_MPI=ON Ginkgo will assume that it is linked against a mpi library that supports device memory. If that is not the case, the mpi applications will just crash, without any indication as to why. So my suggestion would be to disable it and leave it to users to explicitly enable it, only if they know that their mpi supports device memory.

lahwaacz commented 1 month ago

On second thought, -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc might not be the intended way to compile Ginkgo, since hipcc treats all source files as HIP language source files. When I used -DCMAKE_CXX_COMPILER=/opt/rocm/lib/llvm/bin/amdclang++ instead, the build actually passed.

I have no idea why using g++ for CXX compiler does not work anymore, though :shrug:

As for MPI, in Arch Linux we specifically have a GPU-aware OpenMPI package and don't support switching to another MPI library.