kokkos / kokkos-kernels

Kokkos C++ Performance Portability Programming Ecosystem: Math Kernels - Provides BLAS, Sparse BLAS and Graph Kernels
Other
310 stars 97 forks source link

QR on a single matrix: valgrind reports invalid reads and writes #2328

Open cwsmith opened 2 months ago

cwsmith commented 2 months ago

Hello,

Calling SerialQR on a single matrix defined as Kokkos::View<double[16][10]> and running with the Kokkos Serial backend results in valgrind invalid read and write errors (pasted below). The reproducer is pasted below.

Interestingly, when using the CUDA backend in an expanded version of the reproducer (which includes a result comparison after applying the QR factorization via ApplyQ and Trsv) there are no obvious issues.

Note, I'm still figuring out how the QR interface works, hence the single matrix input to QR.

Am I doing anything obviously wrong here? Any help is appreciated.

reproducer

#include <KokkosBatched_QR_Decl.hpp>     //KokkosBlas::QR
#include <KokkosBatched_Util.hpp>        //KokkosBlas::Algo
#include <Kokkos_Core.hpp>

void testQR() {
  typedef Kokkos::View<double[16][10]> MatrixViewType;
  typedef Kokkos::View<double[10]> ColVectorViewType;
  typedef Kokkos::View<double[10]> ColWorkViewType;

  MatrixViewType A("A");
  ColVectorViewType t("t");
  ColWorkViewType w("w");

  // roughly following
  // kokkos-kernels/batched/dense/unit_test/Test_Batched_TeamVectorQR.hpp
  typedef KokkosBlas::Algo::QR::Unblocked AlgoTagType;
  Kokkos::parallel_for("serialQR", 1, KOKKOS_LAMBDA(int) {
        // compute the QR factorization of A and store the results in A and t
        // (tau) - see the lapack dgeqp3(...) documentation:
        // www.netlib.org/lapack/explore-html-3.6.1/dd/d9a/group__double_g_ecomputational_ga1b0500f49e03d2771b797c6e88adabbb.html
        KokkosBatched::SerialQR<AlgoTagType>::invoke(A, t, w);
      });
}

int main(int argc, char **argv) {
  Kokkos::ScopeGuard scope_gaurd(argc, argv);
  testQR();
}

kokkos and kokkos-kernels build

I'm building kokkos (develop @ c2a342b26) and kokkos-kernels (develop @ f26fbca1b) with the following cmake commands using GCC 12.3.0 on a RHEL9 system.

bdir=buildKokkosSerial
cmake -S kokkos -B $bdir \
  -DBUILD_SHARED_LIBS=on \
  -DCMAKE_CXX_COMPILER=g++ \
  -DKokkos_ENABLE_SERIAL=ON \
  -DKokkos_ENABLE_OPENMP=off \
  -DKokkos_ENABLE_DEBUG=off \
  -DCMAKE_INSTALL_PREFIX=$PWD/$bdir/install
cmake --build $bdir -j 24 --target install

bdir=buildKokkosKernelsSerial
cmake -S kokkos-kernels -B $bdir \
  -DCMAKE_CXX_COMPILER=g++ \
  -DKokkos_ROOT=buildKokkosSerial/install \
  -DCMAKE_INSTALL_PREFIX=$bdir/install
cmake --build $bdir -j 24 --target install

valgrind errors

==3560139== Memcheck, a memory error detector
==3560139== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==3560139== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info
==3560139== Command: ./QRTests
==3560139==
==3560139== Invalid read of size 8
==3560139==    at 0x413BF5: int KokkosBatched::SerialLeftHouseholderInternal::invoke<double>(int, double*, double*, int, double*) (KokkosBatched_Householder_Serial_Internal.hpp:48)
==3560139==    by 0x4129B9: int KokkosBatched::SerialQR_Internal::invoke<double>(int, int, double*, int, int, double*, int, double*) (KokkosBatched_QR_Serial_Internal.hpp:68)
==3560139==    by 0x4118B0: int KokkosBatched::SerialQR<KokkosBlas::Algo::Level3::Unblocked>::invoke<Kokkos::View<double [16][10]>, Kokkos::View<double [10]>, Kokkos::View<double [10]> >(Kokkos::View<double [16][10]> const&, Kokkos::View<double [10]> const&, Kokkos::View<double [10]> const&) (KokkosBatched_QR_Serial_Impl.hpp:34)
==3560139==    by 0x40F5DF: testQR()::{lambda(int)#1}::operator()(int) const (testQR.cpp:21)
==3560139==    by 0x410015: std::enable_if<is_void_v<void>, void>::type Kokkos::Impl::ParallelFor<testQR()::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Serial>, Kokkos::Serial>::exec<void>() const (Kokkos_Serial_Parallel_Range.hpp:37)
==3560139==    by 0x40FEBF: Kokkos::Impl::ParallelFor<testQR()::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Serial>, Kokkos::Serial>::execute() const (Kokkos_Serial_Parallel_Range.hpp:56)
==3560139==    by 0x40FC85: void Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::Serial>, testQR()::{lambda(int)#1}, void>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, Kokkos::RangePolicy<Kokkos::Serial> const&, testQR()::{lambda(int)#1} const&) (Kokkos_Parallel.hpp:146)
==3560139==    by 0x40FA98: void Kokkos::parallel_for<testQR()::{lambda(int)#1}>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, testQR()::{lambda(int)#1} const&) (Kokkos_Parallel.hpp:167)
==3560139==    by 0x40F786: testQR() (testQR.cpp:17)
==3560139==    by 0x40F8C2: main (testQR.cpp:27)
==3560139==  Address 0x8991400 is 0 bytes after a block of size 1,408 alloc'd
==3560139==    at 0x484615B: operator new(unsigned long, std::align_val_t, std::nothrow_t const&) (vg_replace_malloc.c:663)
==3560139==    by 0x5C8C0A9: Kokkos::HostSpace::impl_allocate(char const*, unsigned long, unsigned long, Kokkos_Profiling_SpaceHandle) const (Kokkos_HostSpace.cpp:79)
==3560139==    by 0x5C8C284: Kokkos::HostSpace::allocate(char const*, unsigned long, unsigned long) const (Kokkos_HostSpace.cpp:58)
==3560139==    by 0x5C8C767: checked_allocation_with_header<Kokkos::HostSpace> (Kokkos_SharedAlloc.hpp:203)
==3560139==    by 0x5C8C767: Kokkos::Impl::SharedAllocationRecordCommon<Kokkos::HostSpace>::SharedAllocationRecordCommon(Kokkos::HostSpace const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, void (*)(Kokkos::Impl::SharedAllocationRecord<void, void>*)) (Kokkos_SharedAlloc_timpl.hpp:62)
==3560139==    by 0x4150F8: Kokkos::Impl::SharedAllocationRecord<Kokkos::HostSpace, void>::SharedAllocationRecordCommon(Kokkos::HostSpace const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, void (*)(Kokkos::Impl::SharedAllocationRecord<void, void>*)) (Kokkos_HostSpace.hpp:178)
==3560139==    by 0x41513E: Kokkos::Impl::SharedAllocationRecord<Kokkos::HostSpace, Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, double> >::SharedAllocationRecord(Kokkos::HostSpace const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long) (Kokkos_SharedAlloc.hpp:419)
==3560139==    by 0x4144C9: Kokkos::Impl::SharedAllocationRecord<Kokkos::HostSpace, Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, double> >::allocate(Kokkos::HostSpace const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long) (Kokkos_SharedAlloc.hpp:434)
==3560139==    by 0x4132B8: Kokkos::Impl::SharedAllocationRecord<void, void>* Kokkos::Impl::ViewMapping<Kokkos::ViewTraits<double [16][10]>, void>::allocate_shared<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::HostSpace, Kokkos::Serial>(Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<cha
==3560139==    by 0x4124F2: Kokkos::View<double [16][10]>::View<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&, std::enable_if<!Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char
==3560139==    by 0x4116EC: Kokkos::View<double [16][10]>::View<char [2]>(char const (&) [2], std::enable_if<Kokkos::Impl::is_view_label<char [2]>::value, unsigned long const>::type, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long) (Kokkos_ViewLegacy.hpp:1158)
==3560139==    by 0x40F666: testQR() (testQR.cpp:10)
==3560139==    by 0x40F8C2: main (testQR.cpp:27)
==3560139==
==3560139== Invalid write of size 8
==3560139==    at 0x413C62: int KokkosBatched::SerialLeftHouseholderInternal::invoke<double>(int, double*, double*, int, double*) (KokkosBatched_Householder_Serial_Internal.hpp:55)
==3560139==    by 0x4129B9: int KokkosBatched::SerialQR_Internal::invoke<double>(int, int, double*, int, int, double*, int, double*) (KokkosBatched_QR_Serial_Internal.hpp:68)
==3560139==    by 0x4118B0: int KokkosBatched::SerialQR<KokkosBlas::Algo::Level3::Unblocked>::invoke<Kokkos::View<double [16][10]>, Kokkos::View<double [10]>, Kokkos::View<double [10]> >(Kokkos::View<double [16][10]> const&, Kokkos::View<double [10]> const&, Kokkos::View<double [10]> const&) (KokkosBatched_QR_Serial_Impl.hpp:34)
==3560139==    by 0x40F5DF: testQR()::{lambda(int)#1}::operator()(int) const (testQR.cpp:21)
==3560139==    by 0x410015: std::enable_if<is_void_v<void>, void>::type Kokkos::Impl::ParallelFor<testQR()::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Serial>, Kokkos::Serial>::exec<void>() const (Kokkos_Serial_Parallel_Range.hpp:37)
==3560139==    by 0x40FEBF: Kokkos::Impl::ParallelFor<testQR()::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Serial>, Kokkos::Serial>::execute() const (Kokkos_Serial_Parallel_Range.hpp:56)
==3560139==    by 0x40FC85: void Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::Serial>, testQR()::{lambda(int)#1}, void>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, Kokkos::RangePolicy<Kokkos::Serial> const&, testQR()::{lambda(int)#1} const&) (Kokkos_Parallel.hpp:146)
==3560139==    by 0x40FA98: void Kokkos::parallel_for<testQR()::{lambda(int)#1}>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, testQR()::{lambda(int)#1} const&) (Kokkos_Parallel.hpp:167)
==3560139==    by 0x40F786: testQR() (testQR.cpp:17)
==3560139==    by 0x40F8C2: main (testQR.cpp:27)
==3560139==  Address 0x8991890 is 0 bytes after a block of size 208 alloc'd
==3560139==    at 0x484615B: operator new(unsigned long, std::align_val_t, std::nothrow_t const&) (vg_replace_malloc.c:663)
==3560139==    by 0x5C8C0A9: Kokkos::HostSpace::impl_allocate(char const*, unsigned long, unsigned long, Kokkos_Profiling_SpaceHandle) const (Kokkos_HostSpace.cpp:79)
==3560139==    by 0x5C8C284: Kokkos::HostSpace::allocate(char const*, unsigned long, unsigned long) const (Kokkos_HostSpace.cpp:58)
==3560139==    by 0x5C8C767: checked_allocation_with_header<Kokkos::HostSpace> (Kokkos_SharedAlloc.hpp:203)
==3560139==    by 0x5C8C767: Kokkos::Impl::SharedAllocationRecordCommon<Kokkos::HostSpace>::SharedAllocationRecordCommon(Kokkos::HostSpace const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, void (*)(Kokkos::Impl::SharedAllocationRecord<void, void>*)) (Kokkos_SharedAlloc_timpl.hpp:62)
==3560139==    by 0x4150F8: Kokkos::Impl::SharedAllocationRecord<Kokkos::HostSpace, void>::SharedAllocationRecordCommon(Kokkos::HostSpace const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, void (*)(Kokkos::Impl::SharedAllocationRecord<void, void>*)) (Kokkos_HostSpace.hpp:178)
==3560139==    by 0x41513E: Kokkos::Impl::SharedAllocationRecord<Kokkos::HostSpace, Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, double> >::SharedAllocationRecord(Kokkos::HostSpace const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long) (Kokkos_SharedAlloc.hpp:419)
==3560139==    by 0x4144C9: Kokkos::Impl::SharedAllocationRecord<Kokkos::HostSpace, Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, double> >::allocate(Kokkos::HostSpace const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long) (Kokkos_SharedAlloc.hpp:434)
==3560139==    by 0x413584: Kokkos::Impl::SharedAllocationRecord<void, void>* Kokkos::Impl::ViewMapping<Kokkos::ViewTraits<double [10]>, void>::allocate_shared<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::HostSpace, Kokkos::Serial>(Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >
==3560139==    by 0x4126C6: Kokkos::View<double [10]>::View<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&, std::enable_if<!Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >
==3560139==    by 0x4117B8: Kokkos::View<double [10]>::View<char [2]>(char const (&) [2], std::enable_if<Kokkos::Impl::is_view_label<char [2]>::value, unsigned long const>::type, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long) (Kokkos_ViewLegacy.hpp:1158)
==3560139==    by 0x40F6A2: testQR() (testQR.cpp:11)
==3560139==    by 0x40F8C2: main (testQR.cpp:27)

... snip
lucbv commented 2 months ago

Thanks for reporting, we will have a look at this

cwsmith commented 2 months ago

Hi @lucbv . Thanks for looking into this.

It seems like the problem is related to handling rectangular matrices. If the MatrixViewType is defined to be 10x10:

typedef Kokkos::View<double[16][10]> MatrixViewType;

there are no errors under valgrind.

Digging into the code a bit, but without a full understanding of it, I see that this loop over matrix rows:

https://github.com/kokkos/kokkos-kernels/blob/2c4dd7e2bd3f19344ab6aa6e851d8adf1ffd3897/batched/dense/impl/KokkosBatched_QR_Serial_Internal.hpp#L56-L60

that successively removes one row and one column to form the 3x3 partitioned matrix A_part3x3 via the call A_part3x3.partWithABR(A_part2x2, 1, 1);.

In the original case of the 16x10 matrix, running valgrind with the gdbserver I see that the first invalid read occurs in SerialLeftHouseholderInternal::invoke(...) when m_atl=10 in SerialQR_Internal::invoke(...). This seems to make sense as all 10 columns have been removed.

The test case in the repo for QR appears to only run with square matrices:

test without column pivoting: https://github.com/kokkos/kokkos-kernels/blob/2c4dd7e2bd3f19344ab6aa6e851d8adf1ffd3897/batched/dense/unit_test/Test_Batched_TeamVectorQR.hpp#L111

test 'WithColumnPivoting': https://github.com/kokkos/kokkos-kernels/blob/2c4dd7e2bd3f19344ab6aa6e851d8adf1ffd3897/batched/dense/unit_test/Test_Batched_TeamVectorQR_WithColumnPivoting.hpp#L121

lucbv commented 2 months ago

Okay, thanks for digging a bit into this, I will run the code in valgrind / gdb as well and hopefully can reproduce and report my observation. The algorithm indeed uses a partitioning in the matrix to perform some operations but it should still work for rectangular matrices. Once I find something promising I will let you know about it : )

lucbv commented 1 month ago

The PR above, #2342, has a fix for the rectangular matrices and introduces more tests for the Serial QR feature. The tests are not fully implemented yet but the fix seems to be okay if you want to give it a try.

cwsmith commented 1 month ago

This is great. Thank you @lucbv.

Using the PR branch (https://github.com/kokkos/kokkos-kernels/pull/2342/commits/9121f0a5bc697cfa55a68998e037a895a460677c) I ran the reproducer under valgrind again and the SerialQR errors are gone.

Running the expanded version of the reproducer: https://github.com/SCOREC/meshFields/blob/20a68919b338003ff8792ce7d2cc6c5df3f13613/test/testQR.cpp under valgrind reports an invalid read in the call to SerialTrsv. Some additional details on the first invalid read and the valgrind log are below.

stack at first Trsv invalid read

Running under gdb reports the following values of variables at the point of the first reported invalid read.

Reading symbols from /opt/scorec/spack/rhel9/v0201_4/install/linux-rhel9-x86_64/gcc-12.3.0/libiconv-1.17-oylnknwv5m57zkfjde7op6ne3pqdkhxg/lib/libiconv.so.2...
0x0000000000427b8f in KokkosBatched::SerialTrsvInternalUpper<KokkosBlas::Algo::Level2::Unblocked>::invoke<double, double> (use_unit_diag=false, m=16, alpha=1, A=0x8991f00, as0=1, as1=16, b=0x8993940, bs0=1)
    at /space/cwsmith/meshFields/buildKokkosKernelsSerial/install/include/KokkosBatched_Trsv_Serial_Internal.hpp:161
161           if (!use_unit_diag) *beta1 = *beta1 / A[p * as0 + p * as1];
(ins)(gdb) where
#0  0x0000000000427b8f in KokkosBatched::SerialTrsvInternalUpper<KokkosBlas::Algo::Level2::Unblocked>::invoke<double, double> (use_unit_diag=false, m=16, alpha=1, A=0x8991f00, as0=1, as1=16, b=0x8993940, bs0=1)
    at /space/cwsmith/meshFields/buildKokkosKernelsSerial/install/include/KokkosBatched_Trsv_Serial_Internal.hpp:161
#1  0x0000000000423408 in KokkosBatched::SerialTrsv<KokkosBatched::Uplo::Upper, KokkosBlas::Trans::NoTranspose, KokkosBatched::Diag::NonUnit, KokkosBlas::Algo::Level2::Unblocked>::invoke<double, Kokkos::View<double [16][10], Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> >, Kokkos::View<double [16], Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> > > (alpha=1, A=..., b=...)
    at /space/cwsmith/meshFields/buildKokkosKernelsSerial/install/include/KokkosBatched_Trsv_Serial_Impl.hpp:191
#2  0x000000000041a7c8 in operator() (__closure=0x1ffefef0d0)
    at /space/cwsmith/meshFields/meshFields/test/testQR.cpp:127
#3  0x000000000041bf36 in Kokkos::Impl::ParallelFor<testSolveQR()::<lambda(int)>, Kokkos::RangePolicy<Kokkos::Serial>, Kokkos::Serial>::exec<void>(void) const (this=0x1ffefef0d0)
    at /space/cwsmith/meshFields/buildKokkosSerial/install/include/Serial/Kokkos_Serial_Parallel_Range.hpp:37
#4  0x000000000041bb86 in Kokkos::Impl::ParallelFor<testSolveQR()::<lambda(int)>, Kokkos::RangePolicy<Kokkos::Serial>, Kokkos::Serial>::execute(void) const (this=0x1ffefef0d0)
    at /space/cwsmith/meshFields/buildKokkosSerial/install/include/Serial/Kokkos_Serial_Parallel_Range.hpp:56
#5  0x000000000041b71e in Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::Serial>, testSolveQR()::<lambda(int)> >(const std::string &, const Kokkos::RangePolicy<Kokkos::Serial> &, const struct {...} &) (str=..., policy=..., functor=...)
    at /space/cwsmith/meshFields/buildKokkosSerial/install/include/Kokkos_Parallel.hpp:146
#6  0x000000000041b4c3 in Kokkos::parallel_for<testSolveQR()::<lambda(int)> >(const std::string &, size_t, const struct {...} &) (str=..., work_count=1, functor=...)
    at /space/cwsmith/meshFields/buildKokkosSerial/install/include/Kokkos_Parallel.hpp:167
#7  0x000000000041afbc in testSolveQR () at /space/cwsmith/meshFields/meshFields/test/testQR.cpp:113
#8  0x000000000041b2ed in main (argc=1, argv=0x1ffefef608) at /space/cwsmith/meshFields/meshFields/test/testQR.cpp:140
(ins)(gdb) p p
$1 = 15
(ins)(gdb) p m
$2 = 16

Given this loop from p=m-1:0 https://github.com/kokkos/kokkos-kernels/blob/b2210058826672c8de838541a36f7b946ecbb79a/batched/dense/impl/KokkosBatched_Trsv_Serial_Internal.hpp#L152 and the use of p twice to compute the index into A here (where the invalid read occurs): https://github.com/kokkos/kokkos-kernels/blob/b2210058826672c8de838541a36f7b946ecbb79a/batched/dense/impl/KokkosBatched_Trsv_Serial_Internal.hpp#L161 , it looks that the assumption that A is square is made here as well.

valgrind log

==265289== Memcheck, a memory error detector
==265289== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al. 
==265289== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info
==265289== Command: ./QRTests
==265289== 
==265289== Invalid read of size 8
==265289==    at 0x427B8F: int KokkosBatched::SerialTrsvInternalUpper<KokkosBlas::Algo::Level2::Unblocked>::invoke<double, double>(bool, int, double, double const*, int, int, double*, int) (KokkosBatched_Trsv_Serial_Internal.hpp:161)
==265289==    by 0x423407: int KokkosBatched::SerialTrsv<KokkosBatched::Uplo::Upper, KokkosBlas::Trans::NoTranspose, KokkosBatched::Diag::NonUnit, KokkosBlas::Algo::Level2::Unblocked>::invoke<double, Kokkos::View<double [16][10], Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> >, Kokkos::View<double [16], Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> > >(double, Kokkos::View<double [16][10], Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> > const&, Kokkos::View<double [16], Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> > const&) (KokkosBatched_Trsv_Serial_Impl.hpp:191)
==265289==    by 0x41A7C7: testSolveQR()::{lambda(int)#1}::operator()(int) const (testQR.cpp:127)
==265289==    by 0x41BF35: std::enable_if<is_void_v<void>, void>::type Kokkos::Impl::ParallelFor<testSolveQR()::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Serial>, Kokkos::Serial>::exec<void>() const (Kokkos_Serial_Parallel_Range.hpp:37)
==265289==    by 0x41BB85: Kokkos::Impl::ParallelFor<testSolveQR()::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Serial>, Kokkos::Serial>::execute() const (Kokkos_Serial_Parallel_Range.hpp:56)
==265289==    by 0x41B71D: void Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::Serial>, testSolveQR()::{lambda(int)#1}, void>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, Kokkos::RangePolicy<Kokkos::Serial> const&, testSolveQR()::{lambda(int)#1} const&) (Kokkos_Parallel.hpp:146)
==265289==    by 0x41B4C2: void Kokkos::parallel_for<testSolveQR()::{lambda(int)#1}>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, testSolveQR()::{lambda(int)#1} const&) (Kokkos_Parallel.hpp:167)
==265289==    by 0x41AFBB: testSolveQR() (testQR.cpp:113)
==265289==    by 0x41B2EC: main (testQR.cpp:140)
==265289==  Address 0x89926f8 is 16 bytes after a block of size 40 free'd
==265289==    at 0x484893D: operator delete(void*, unsigned long) (vg_replace_malloc.c:1101)
==265289==    by 0x423D0C: Kokkos::Impl::HostSharedPtr<Kokkos::Impl::SerialInternal>::cleanup() (Kokkos_HostSharedPtr.hpp:120)
==265289==    by 0x41DDA1: Kokkos::Impl::HostSharedPtr<Kokkos::Impl::SerialInternal>::~HostSharedPtr() (Kokkos_HostSharedPtr.hpp:92)
==265289==    by 0x41CC4F: Kokkos::Serial::~Serial() (Kokkos_Serial.hpp:95)
==265289==    by 0x423E87: Kokkos::Impl::ViewCtorProp<void, Kokkos::Serial>::~ViewCtorProp() (Kokkos_ViewCtor.hpp:122)
==265289==    by 0x423EA7: Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Kokkos::HostSpace, Kokkos::Serial>::~ViewCtorProp() (Kokkos_ViewCtor.hpp:182)
==265289==    by 0x425342: Kokkos::View<double [10], Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> >::View<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&, std::enable_if<!Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >::has_pointer, Kokkos::LayoutLeft>::type const&) (Kokkos_ViewLegacy.hpp:1048)
==265289==    by 0x41F9BC: Kokkos::View<double [10], Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> >::View<char [3]>(char const (&) [3], std::enable_if<Kokkos::Impl::is_view_label<char [3]>::value, unsigned long const>::type, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long) (Kokkos_ViewLegacy.hpp:1158)
==265289==    by 0x41AC8F: testSolveQR() (testQR.cpp:89)
==265289==    by 0x41B2EC: main (testQR.cpp:140)
==265289==  Block was alloc'd at
==265289==    at 0x4844F95: operator new(unsigned long) (vg_replace_malloc.c:483)
==265289==    by 0x5C94902: HostSharedPtr<Kokkos::Serial::Serial()::<lambda(Kokkos::Impl::SerialInternal*)> > (Kokkos_HostSharedPtr.hpp:47)
==265289==    by 0x5C94902: Kokkos::Serial::Serial() (Kokkos_Serial.cpp:160)
==265289==    by 0x4251A9: Kokkos::View<double [10], Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> >::View<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&, std::enable_if<!Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >::has_pointer, Kokkos::LayoutLeft>::type const&) (Kokkos_ViewLegacy.hpp:1002)
==265289==    by 0x41F9BC: Kokkos::View<double [10], Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> >::View<char [3]>(char const (&) [3], std::enable_if<Kokkos::Impl::is_view_label<char [3]>::value, unsigned long const>::type, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long) (Kokkos_ViewLegacy.hpp:1158)
==265289==    by 0x41AC8F: testSolveQR() (testQR.cpp:89)
==265289==    by 0x41B2EC: main (testQR.cpp:140)
==265289== 
==265289== Invalid read of size 8
==265289==    at 0x427BDC: int KokkosBatched::SerialTrsvInternalUpper<KokkosBlas::Algo::Level2::Unblocked>::invoke<double, double>(bool, int, double, double const*, int, int, double*, int) (KokkosBatched_Trsv_Serial_Internal.hpp:163)
==265289==    by 0x423407: int KokkosBatched::SerialTrsv<KokkosBatched::Uplo::Upper, KokkosBlas::Trans::NoTranspose, KokkosBatched::Diag::NonUnit, KokkosBlas::Algo::Level2::Unblocked>::invoke<double, Kokkos::View<double [16][10], Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> >, Kokkos::View<double [16], Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> > >(double, Kokkos::View<double [16][10], Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> > const&, Kokkos::View<double [16], Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> > const&) (KokkosBatched_Trsv_Serial_Impl.hpp:191)
==265289==    by 0x41A7C7: testSolveQR()::{lambda(int)#1}::operator()(int) const (testQR.cpp:127)
==265289==    by 0x41BF35: std::enable_if<is_void_v<void>, void>::type Kokkos::Impl::ParallelFor<testSolveQR()::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Serial>, Kokkos::Serial>::exec<void>() const (Kokkos_Serial_Parallel_Range.hpp:37)
==265289==    by 0x41BB85: Kokkos::Impl::ParallelFor<testSolveQR()::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Serial>, Kokkos::Serial>::execute() const (Kokkos_Serial_Parallel_Range.hpp:56)
==265289==    by 0x41B71D: void Kokkos::parallel_for<Kokkos::RangePolicy<Kokkos::Serial>, testSolveQR()::{lambda(int)#1}, void>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, Kokkos::RangePolicy<Kokkos::Serial> const&, testSolveQR()::{lambda(int)#1} const&) (Kokkos_Parallel.hpp:146)
==265289==    by 0x41B4C2: void Kokkos::parallel_for<testSolveQR()::{lambda(int)#1}>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, testSolveQR()::{lambda(int)#1} const&) (Kokkos_Parallel.hpp:167)
==265289==    by 0x41AFBB: testSolveQR() (testQR.cpp:113)
==265289==    by 0x41B2EC: main (testQR.cpp:140)
==265289==  Address 0x8992680 is 6 bytes after a block of size 74 free'd
==265289==    at 0x484893D: operator delete(void*, unsigned long) (vg_replace_malloc.c:1101)
==265289==    by 0x5C94C5D: deallocate (new_allocator.h:158)
==265289==    by 0x5C94C5D: deallocate (alloc_traits.h:496)
==265289==    by 0x5C94C5D: _M_destroy (basic_string.h:300)
==265289==    by 0x5C94C5D: _M_dispose (basic_string.h:294)
==265289==    by 0x5C94C5D: ~basic_string (basic_string.h:803)
==265289==    by 0x5C94C5D: profile_fence_event<Kokkos::Serial, Kokkos::Serial::impl_static_fence(const std::string&)::<lambda()> > (Kokkos_Profiling.hpp:219)
==265289==    by 0x5C94C5D: impl_static_fence (Kokkos_Serial.hpp:147)
==265289==    by 0x5C94C5D: Kokkos::Impl::ExecSpaceDerived<Kokkos::Serial>::static_fence(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (Kokkos_ExecSpaceManager.hpp:131)
==265289==    by 0x5C85684: Kokkos::Impl::ExecSpaceManager::static_fence(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (Kokkos_Core.cpp:243)
==265289==    by 0x41E8BB: void Kokkos::deep_copy<double [16][10], Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, double [16][10], Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, Kokkos::Experimental::EmptyViewHooks>(Kokkos::View<double [16][10], Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> > const&, Kokkos::View<double [16][10], Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, Kokkos::Experimental::EmptyViewHooks> const&, std::enable_if<((is_void_v<Kokkos::ViewTraits<double [16][10], Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> >::specialize>)&&(is_void_v<Kokkos::ViewTraits<double [16][10], Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, Kokkos::Experimental::EmptyViewHooks>::specialize>))&&((((unsigned int)Kokkos::ViewTraits<double [16][10], Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> >::rank)!=(0))||(((unsigned int)Kokkos::ViewTraits<double [16][10], Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, Kokkos::Experimental::EmptyViewHooks>::rank)!=(0))), void>::type*) (Kokkos_CopyViews.hpp:1709)
==265289==    by 0x41AC57: testSolveQR() (testQR.cpp:87)
==265289==    by 0x41B2EC: main (testQR.cpp:140)
==265289==  Block was alloc'd at
==265289==    at 0x4844F95: operator new(unsigned long) (vg_replace_malloc.c:483)
==265289==    by 0x5C943FB: void std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char*>(char*, char*, std::forward_iterator_tag) [clone .isra.0] (basic_string.tcc:225)
==265289==    by 0x5C94C36: basic_string (basic_string.h:552)
==265289==    by 0x5C94C36: profile_fence_event<Kokkos::Serial, Kokkos::Serial::impl_static_fence(const std::string&)::<lambda()> > (Kokkos_Profiling.hpp:219)
==265289==    by 0x5C94C36: impl_static_fence (Kokkos_Serial.hpp:147)
==265289==    by 0x5C94C36: Kokkos::Impl::ExecSpaceDerived<Kokkos::Serial>::static_fence(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (Kokkos_ExecSpaceManager.hpp:131)
==265289==    by 0x5C85684: Kokkos::Impl::ExecSpaceManager::static_fence(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (Kokkos_Core.cpp:243)
==265289==    by 0x41E8BB: void Kokkos::deep_copy<double [16][10], Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, double [16][10], Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, Kokkos::Experimental::EmptyViewHooks>(Kokkos::View<double [16][10], Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> > const&, Kokkos::View<double [16][10], Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, Kokkos::Experimental::EmptyViewHooks> const&, std::enable_if<((is_void_v<Kokkos::ViewTraits<double [16][10], Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> >::specialize>)&&(is_void_v<Kokkos::ViewTraits<double [16][10], Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, Kokkos::Experimental::EmptyViewHooks>::specialize>))&&((((unsigned int)Kokkos::ViewTraits<double [16][10], Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> >::rank)!=(0))||(((unsigned int)Kokkos::ViewTraits<double [16][10], Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, Kokkos::Experimental::EmptyViewHooks>::rank)!=(0))), void>::type*) (Kokkos_CopyViews.hpp:1709)
==265289==    by 0x41AC57: testSolveQR() (testQR.cpp:87)
==265289==    by 0x41B2EC: main (testQR.cpp:140)
==265289== 
==265289== 
==265289== HEAP SUMMARY:
==265289==     in use at exit: 176 bytes in 1 blocks
==265289==   total heap usage: 230 allocs, 229 frees, 115,919 bytes allocated
==265289== 
==265289== LEAK SUMMARY:
==265289==    definitely lost: 0 bytes in 0 blocks
==265289==    indirectly lost: 0 bytes in 0 blocks
==265289==      possibly lost: 0 bytes in 0 blocks
==265289==    still reachable: 176 bytes in 1 blocks
==265289==         suppressed: 0 bytes in 0 blocks
==265289== Rerun with --leak-check=full to see details of leaked memory
==265289== 
==265289== For lists of detected and suppressed errors, rerun with: -s
==265289== ERROR SUMMARY: 80 errors from 2 contexts (suppressed: 0 from 0)
lucbv commented 1 month ago

Okay, I will try to wrap up the PR and get that tested and merged, then I can move on to trsv, hopefully it's not more complicated than the QR fix but writing proper tests is what takes time!

lucbv commented 1 month ago

So I have not looked at it in detail but my guess is that we are assuming the triangular matrix to be stored in a square matrix, size mxm. Since yours is coming for the QR factorization of a rectangular matrix we need to fix the code so that it works for a mxn input matrix, basically we will ignore the non-square part of the input. Should not be too bad hopefully. I will create a PR once I have confirmed that this is the issue and we have a fix for you...