ralna / spral

Sparse Parallel Robust Algorithms Library
https://ralna.github.io/spral/
Other
104 stars 27 forks source link

Error while running tests (ldlt_app.cpp) #140

Closed bharswami closed 11 months ago

bharswami commented 11 months ago

I am getting the following index error Unhandled exception at 0x00007FFBB78B829C (ucrtbased.dll) in spraltestfull.exe: An invalid parameter was passed to a function that considers invalid parameters fatal.

if _CONTAINER_DEBUG_LEVEL > 0

    _STL_VERIFY(
        _Pos < static_cast<size_type>(_My_data._Mylast - _My_data._Myfirst), "vector subscript out of range");

endif // _CONTAINER_DEBUG_LEVEL > 0

The higher level functions are (in order, last call first)

  1. copy_failed_rect( get_nrow(nblk - 1, m, block_size), get_ncol(jblk, n, block_size), get_ncol(nblk - 1, n, block_size), cdata[jblk], &failed_rect[jfail (m - n) + (nblk - 1) block_size - n], m - n, &a[jblk block_size lda + (nblk - 1) * block_size], lda );
  2. int q1 = LDLT <T, INNER_BLOCK_SIZE, CopyBackup, use_tasks, debug> ::factor( m, n, perm, l, lda, d, backup, options, options.pivot_method, outer_block_size, 0.0, nullptr, 0, work );
  3. TEST(( ldlt_test<double, 2, false, false>(0.01, 1e-20, true, false, false, 4, 2) ));
  4. int temp; temp = run_ldlt_app_tests(); Please clarify.
mjacobse commented 11 months ago

I can confirm this. The problem is that operator[] of the std::vector failed_rect is called out of bounds. However, in this case, nothing seems to be done with the value other than taking the address. That's why this does not really cause issues in practice and why this is not found by address sanitizer or valgrind. But still, calling operator[] out of bounds is undefined behaviour, so the Microsoft STL is correctly detecting this as an error. Compiling with -D_GLIBCXX_DEBUG makes gcc's stdlibc++ reveal the error too. In fact there are a handful of these in a few places in the code, as this pattern of taking the address of operator[]'s return value is used quite frequently.

A simple fix would be to use the data() member function instead. That allows doing the pointer arithmetic without indexing into the vector with operator[]. So &failed_rect[jfail*(m-n)+(nblk-1)*block_size-n] would become failed_rect.data() + jfail*(m-n)+(nblk-1)*block_size-n.

Should we do that for the few cases flagged by -D_GLIBCXX_DEBUG? Or should we replace every case of the &something[] pattern? The ones that do not cause an error appear to only be used in-bounds, making a change unnecessary. But it's also possible that the tests just do not run into error cases which could potentially still be triggered by different inputs to the library.

jfowkes commented 11 months ago

Thank you @bharswami for reporting and @mjacobse for confirming.

I'd be tempted for now to just fix the cases flagged up by -D_GLIBCXX_DEBUG and keep the other (presumably in-bounds) uses of the &something[] pattern as is since these make the pointer arithmetic involved clearer (at least to me) and avoid a potentially very large change to the codebase that could inadvertently introduce other bugs.

@mjacobse would you be happy to make a PR that fixes the -D_GLIBCXX_DEBUG cases?

jfowkes commented 11 months ago

Fixed in #145