kokkos / kokkos-kernels

Kokkos C++ Performance Portability Programming Ecosystem: Math Kernels - Provides BLAS, Sparse BLAS and Graph Kernels
Other
303 stars 96 forks source link

Hang in KokkosBatched::SerialSVD running in Cuda #2344

Open vbrunini opened 1 week ago

vbrunini commented 1 week ago

This example call to SerialSVD hangs when run on the default execution space for a Cuda build, but runs fine in the default host execution space as well as in CPU builds that I have tried:

template <typename ExecSpace>
void call_svd_in_parallel_for()
{
  Kokkos::TeamPolicy<ExecSpace> team_pol(1, Kokkos::AUTO);
  using ScratchMatrix = Kokkos::View<double[3][3], typename ExecSpace::scratch_memory_space>;
  using ScratchVector = Kokkos::View<double[3], typename ExecSpace::scratch_memory_space>;
  team_pol.set_scratch_size(1, Kokkos::PerThread(3*ScratchMatrix::shmem_size() + 3* ScratchVector::shmem_size()));
  Kokkos::parallel_for(team_pol, KOKKOS_LAMBDA(const typename Kokkos::TeamPolicy<ExecSpace>::member_type &team)
  {
    ScratchMatrix A(team.thread_scratch(1));
    ScratchMatrix U(team.thread_scratch(1));
    ScratchMatrix V(team.thread_scratch(1));
    ScratchVector S(team.thread_scratch(1));
    ScratchVector work(team.thread_scratch(1));
    Kokkos::single(Kokkos::PerTeam(team), [&]() {
      A(0, 0) = 0.000000;
      A(1, 0) = 3.58442287931538747e-02;
      A(2, 0) = 0.000000;
      A(0, 1) = 0.000000;
      A(1, 1) = 3.81743062695684907e-02;
      A(2, 1) = 0.000000;
      A(0, 2) = 0.000000;
      A(1, 2) = 0.000000;
      A(2, 2) = -5.55555555555555733e-02;

      KokkosBatched::SerialSVD::invoke(KokkosBatched::SVD_USV_Tag{}, A, U, S, V, work);

      printf("S = {%.16f %.16f %.16f}\n", S(0), S(1), S(2));
      printf("A(0) = {%.16f %.16f %.16f}\n", A(0, 0), A(0, 1), A(0, 2));
      printf("A(1) = {%.16f %.16f %.16f}\n", A(1, 0), A(1, 1), A(1, 2));
      printf("A(2) = {%.16f %.16f %.16f}\n", A(2, 0), A(2, 1), A(2, 2));
      printf("U(0) = {%.16f %.16f %.16f}\n", U(0, 0), U(0, 1), U(0, 2));
      printf("U(1) = {%.16f %.16f %.16f}\n", U(1, 0), U(1, 1), U(1, 2));
      printf("U(2) = {%.16f %.16f %.16f}\n", U(2, 0), U(2, 1), U(2, 2));
      printf("V(0) = {%.16f %.16f %.16f}\n", V(0, 0), V(0, 1), V(0, 2));
      printf("V(1) = {%.16f %.16f %.16f}\n", V(1, 0), V(1, 1), V(1, 2));
      printf("V(2) = {%.16f %.16f %.16f}\n", V(2, 0), V(2, 1), V(2, 2));
    });
  });
}

int main(int argc, char **argv)
{
  Kokkos::initialize(argc, argv);

  {
    printf("Running on host\n");
    call_svd_in_parallel_for<Kokkos::DefaultHostExecutionSpace>();
    Kokkos::fence();
    printf("Done\n");

    printf("Running on device\n");
    call_svd_in_parallel_for<Kokkos::DefaultExecutionSpace>();
    Kokkos::fence();
    printf("Done\n");
  }

  Kokkos::finalize();
  return 0;
}

Cuda testing was on V100 using nvcc-11.2.1 with gcc-8.3.0 host compiler. @brian-kelley

brian-kelley commented 1 week ago

@vbrunini I replicated this, but I'm not yet sure why Serial and Cuda are producing different results. It's not just an issue of numerical error or a tolerance being too small - I increased that and it still stagnates during the shifted QR.

brian-kelley commented 5 days ago

@vbrunini I just posted a fix in #2345. Thanks for the complete reproducer. I added this matrix as a new test case.

vbrunini commented 4 days ago

Thanks, can you cherry-pick the fix over to Trilinos as well once it's ready?

vbrunini commented 4 days ago

FYI, it looks like with the change I now have some other tests hanging that were passing before. Will work on finding the inputs to the SVD that are causing it now, but likely won't have them until next week.

vbrunini commented 21 hours ago

New hanging case:

template <typename ExecSpace>
void call_svd_in_parallel_for()
{
  Kokkos::TeamPolicy<ExecSpace> team_pol(1, Kokkos::AUTO);
  using ScratchMatrix = Kokkos::View<double[3][4], typename ExecSpace::scratch_memory_space>;
  using ScratchVector = Kokkos::View<double[3], typename ExecSpace::scratch_memory_space>;
  team_pol.set_scratch_size(1, Kokkos::PerThread(3*ScratchMatrix::shmem_size() + 3* ScratchVector::shmem_size()));
  Kokkos::parallel_for(team_pol, KOKKOS_LAMBDA(const typename Kokkos::TeamPolicy<ExecSpace>::member_type &team)
  {
    ScratchMatrix A(team.thread_scratch(1));
    ScratchMatrix U(team.thread_scratch(1));
    ScratchMatrix V(team.thread_scratch(1));
    ScratchVector S(team.thread_scratch(1));
    ScratchVector work(team.thread_scratch(1));
    Kokkos::single(Kokkos::PerTeam(team), [&]() {
A(0, 0) = -2.0305040121856084e-02;
A(1, 0) = 0.0000000000000000e+00;
A(2, 0) = 0.0000000000000000e+00;
A(0, 1) = -0.0000000000000000e+00;
A(1, 1) = -0.0000000000000000e+00;
A(2, 1) = 1.9506119814028472e-02;
A(0, 2) = -2.0305040121856091e-02;
A(1, 2) = 0.0000000000000000e+00;
A(2, 2) = 0.0000000000000000e+00;
A(0, 3) = -0.0000000000000000e+00;
A(1, 3) = -0.0000000000000000e+00;
A(2, 3) = 1.9506119814028472e-02;

      KokkosBatched::SerialSVD::invoke(KokkosBatched::SVD_USV_Tag{}, A, U, S, V, work);

      printf("S = {%.16f %.16f %.16f}\n", S(0), S(1), S(2));
      printf("A(0) = {%.16f %.16f %.16f %.16f}\n", A(0, 0), A(0, 1), A(0, 2), A(0, 3));
      printf("A(1) = {%.16f %.16f %.16f %.16f}\n", A(1, 0), A(1, 1), A(1, 2), A(1, 3));
      printf("A(2) = {%.16f %.16f %.16f %.16f}\n", A(2, 0), A(2, 1), A(2, 2), A(2, 3));
      printf("U(0) = {%.16f %.16f %.16f %.16f}\n", U(0, 0), U(0, 1), U(0, 2), U(0, 3));
      printf("U(1) = {%.16f %.16f %.16f %.16f}\n", U(1, 0), U(1, 1), U(1, 2), U(1, 3));
      printf("U(2) = {%.16f %.16f %.16f %.16f}\n", U(2, 0), U(2, 1), U(2, 2), U(2, 3));
      printf("V(0) = {%.16f %.16f %.16f %.16f}\n", V(0, 0), V(0, 1), V(0, 2), V(0, 3));
      printf("V(1) = {%.16f %.16f %.16f %.16f}\n", V(1, 0), V(1, 1), V(1, 2), V(1, 3));
      printf("V(2) = {%.16f %.16f %.16f %.16f}\n", V(2, 0), V(2, 1), V(2, 2), V(2, 3));
    });
  });
}

This one seems to hang both on host & device execution spaces in my cuda build, as well as a clang build with just Kokkos::Serial enabled.

brian-kelley commented 19 hours ago

@vbrunini I fixed the change that was causing this new case to hang. BTW, in this new case U should be 3x3, V should be 4x4 and work should be length 4 = max(m,n). But this wasn't the reason for it to hang, there was a real bug.

vbrunini commented 18 hours ago

@vbrunini I fixed the change that was causing this new case to hang. BTW, in this new case U should be 3x3, V should be 4x4 and work should be length 4 = max(m,n). But this wasn't the reason for it to hang, there was a real bug.

Thanks, looks like it's working with the updated PR changes. And yeah the actual code where we call the SVD uses the right sizes for U/V/work, I just missed updating those for the reproducer.