Open vbrunini opened 1 week ago
@vbrunini I replicated this, but I'm not yet sure why Serial and Cuda are producing different results. It's not just an issue of numerical error or a tolerance being too small - I increased that and it still stagnates during the shifted QR.
@vbrunini I just posted a fix in #2345. Thanks for the complete reproducer. I added this matrix as a new test case.
Thanks, can you cherry-pick the fix over to Trilinos as well once it's ready?
FYI, it looks like with the change I now have some other tests hanging that were passing before. Will work on finding the inputs to the SVD that are causing it now, but likely won't have them until next week.
New hanging case:
template <typename ExecSpace>
void call_svd_in_parallel_for()
{
Kokkos::TeamPolicy<ExecSpace> team_pol(1, Kokkos::AUTO);
using ScratchMatrix = Kokkos::View<double[3][4], typename ExecSpace::scratch_memory_space>;
using ScratchVector = Kokkos::View<double[3], typename ExecSpace::scratch_memory_space>;
team_pol.set_scratch_size(1, Kokkos::PerThread(3*ScratchMatrix::shmem_size() + 3* ScratchVector::shmem_size()));
Kokkos::parallel_for(team_pol, KOKKOS_LAMBDA(const typename Kokkos::TeamPolicy<ExecSpace>::member_type &team)
{
ScratchMatrix A(team.thread_scratch(1));
ScratchMatrix U(team.thread_scratch(1));
ScratchMatrix V(team.thread_scratch(1));
ScratchVector S(team.thread_scratch(1));
ScratchVector work(team.thread_scratch(1));
Kokkos::single(Kokkos::PerTeam(team), [&]() {
A(0, 0) = -2.0305040121856084e-02;
A(1, 0) = 0.0000000000000000e+00;
A(2, 0) = 0.0000000000000000e+00;
A(0, 1) = -0.0000000000000000e+00;
A(1, 1) = -0.0000000000000000e+00;
A(2, 1) = 1.9506119814028472e-02;
A(0, 2) = -2.0305040121856091e-02;
A(1, 2) = 0.0000000000000000e+00;
A(2, 2) = 0.0000000000000000e+00;
A(0, 3) = -0.0000000000000000e+00;
A(1, 3) = -0.0000000000000000e+00;
A(2, 3) = 1.9506119814028472e-02;
KokkosBatched::SerialSVD::invoke(KokkosBatched::SVD_USV_Tag{}, A, U, S, V, work);
printf("S = {%.16f %.16f %.16f}\n", S(0), S(1), S(2));
printf("A(0) = {%.16f %.16f %.16f %.16f}\n", A(0, 0), A(0, 1), A(0, 2), A(0, 3));
printf("A(1) = {%.16f %.16f %.16f %.16f}\n", A(1, 0), A(1, 1), A(1, 2), A(1, 3));
printf("A(2) = {%.16f %.16f %.16f %.16f}\n", A(2, 0), A(2, 1), A(2, 2), A(2, 3));
printf("U(0) = {%.16f %.16f %.16f %.16f}\n", U(0, 0), U(0, 1), U(0, 2), U(0, 3));
printf("U(1) = {%.16f %.16f %.16f %.16f}\n", U(1, 0), U(1, 1), U(1, 2), U(1, 3));
printf("U(2) = {%.16f %.16f %.16f %.16f}\n", U(2, 0), U(2, 1), U(2, 2), U(2, 3));
printf("V(0) = {%.16f %.16f %.16f %.16f}\n", V(0, 0), V(0, 1), V(0, 2), V(0, 3));
printf("V(1) = {%.16f %.16f %.16f %.16f}\n", V(1, 0), V(1, 1), V(1, 2), V(1, 3));
printf("V(2) = {%.16f %.16f %.16f %.16f}\n", V(2, 0), V(2, 1), V(2, 2), V(2, 3));
});
});
}
This one seems to hang both on host & device execution spaces in my cuda build, as well as a clang build with just Kokkos::Serial
enabled.
@vbrunini I fixed the change that was causing this new case to hang. BTW, in this new case U should be 3x3, V should be 4x4 and work should be length 4 = max(m,n). But this wasn't the reason for it to hang, there was a real bug.
@vbrunini I fixed the change that was causing this new case to hang. BTW, in this new case U should be 3x3, V should be 4x4 and work should be length 4 = max(m,n). But this wasn't the reason for it to hang, there was a real bug.
Thanks, looks like it's working with the updated PR changes. And yeah the actual code where we call the SVD uses the right sizes for U/V/work, I just missed updating those for the reproducer.
This example call to SerialSVD hangs when run on the default execution space for a Cuda build, but runs fine in the default host execution space as well as in CPU builds that I have tried:
Cuda testing was on V100 using nvcc-11.2.1 with gcc-8.3.0 host compiler. @brian-kelley