ralna / spral

Sparse Parallel Robust Algorithms Library
https://ralna.github.io/spral/
Other
104 stars 27 forks source link

Possible bug in the code #144

Closed bharswami closed 11 months ago

bharswami commented 11 months ago

Were all the tests provided run successfully? I am facing some access violation errors in running some of the tests. Also some torture tests didn't go through. Please clarify.

jfowkes commented 11 months ago

Yes all SPRAL tests pass fine on Linux and Mac, please see our Github Actions logs here: https://github.com/ralna/spral/actions/runs/6110469212/job/16583533866 https://github.com/ralna/spral/actions/runs/6110469213/job/16583533210

jfowkes commented 11 months ago

@bharswami please try with the latest version from the master branch where we have fixed the ldlt_app issues #140

bharswami commented 11 months ago

Thanks. Will check it out.

amontoison commented 11 months ago

@bharswami I recompiled the latest version from the master branch for Windows. Can you check if the tests are working? They are in the bin folder. SPRAL.v2023.9.23.x86_64-w64-mingw32-libgfortran5.tar.gz

bharswami commented 11 months ago

I replaced ldlt_app.cpp in my PC. All tests passed except this one. Don't know if its a bug

ASSERT_EQ(mv1, mv2) failed
(test_maxloc_torture<double, 128>(10000)) [fail]
jfowkes commented 11 months ago

Thanks @bharswami, we'll see if we can reproduce the maxloc_torture test failure in the block_ldlt kernel.

bharswami commented 11 months ago

Has anybody been able to sort this bug out?

jfowkes commented 11 months ago

We have been unable to reproduce this so far, once we have the meson build system up and running (#141) we will do some thorough testing on Windows machines.

jfowkes commented 11 months ago

@bharswami we have been able to reproduce the LDLT test failure you were seeing.

You can see from the logs for Windows here: https://github.com/ralna/spral/suites/16773345740/artifacts/957859756

Test failure at ../tests/ssids/kernels/block_ldlt.cxx:221
ASSERT_EQ(rloc1, rloc2) failed
[(test_maxloc_torture<double, 128>(10000))[fail]]
jfowkes commented 11 months ago

The test that's failing here is a relatively simple one, it compares the output of a simple find_maxloc function vs the LDLT internal one using AVX instructions:

/* Call both simple and avx versions and check they get the same answer */ 
T mv1, mv2;
int rloc1, cloc1, rloc2, cloc2;

find_maxloc_simple<T, BLOCK_SIZE>(from, a, lda, mv1, rloc1, cloc1);
block_ldlt_internal::find_maxloc<T, BLOCK_SIZE>(from, a, lda, mv2, rloc2, cloc2);

// Compare them
ASSERT_LE(fabs(mv1), 1.0); // If this is wrong, find_maxloc_simple is wrong
ASSERT_EQ(mv1, mv2);
ASSERT_EQ(rloc1, rloc2);
ASSERT_EQ(cloc1, cloc2);

return 0; // Success

The failure suggests that we probably handle the AVX call incorrectly on Windows.

jfowkes commented 11 months ago

@mjacobse do you also think that this test failure is potentially an AVX issue on Windows?

mjacobse commented 11 months ago

I don't think so. I think the block_ldlt_internal::find_maxloc function actually works as intended and that instead the issue is the test itself.

The test generates a test matrix with pseudo-random values using the old C rand function, which can generate only RAND_MAX distinct values. On Linux, RAND_MAX appears to be 2147483647 (https://godbolt.org/z/z9vrdo3ze), while on Windows it's only 32767 (https://godbolt.org/z/Ez7zWGa9e). So with a large enough BLOCK_SIZE, i.e. enough entries in the matrix, the chance of getting a duplicate entry (or even just with same absolute value) is quite high on Windows. Indeed, that's what I obverse when reproducing the failure locally.

The issue then is that in looking for the maximum absolute value the two functions find_maxloc_simple and block_ldlt_internal::find_maxloc essentially walk through the matrix in a different order. Both in some sense go column-wise, but with the SIMD code, block_ldlt_internal::find_maxloc does not look at one row at a time in sequence. Instead it looks at several rows simultaneously and stores the current maximum for each one seperately. So for example assuming two rows at a time, it should come down to walking through a 4x3 matrix (for simplicity rectangular) in this order:

0 2  4 
6 8 10
1 3  5
7 9 11

Due to that difference in order, the two functions may find different maximum entries earlier in case of duplicate entries. And as they stick with the first one they found, the resulting location will be different (even the value may be different if the sign of the value differs).

So basically the test is too brittle in assuming that both functions will find the same entry first in case of duplicate values. Options I can think of to fix this:

jfowkes commented 11 months ago

Many thanks @mjacobse, this poorly written test would also explain why we occasionally see a random test failure on the CI so I am against just simply switching to a better RNG.

I suggest that we implement both your first and last suggestions: there is no excuse to still be using the old C rand function so let's upgrade that to std::uniform_real_distribution and also check for duplicate entries in the test first and if any are found then drop the order behaviour tests (this should be a relatively rare occurrence but will prevent the test randomly failing as they seem to occasionally).

Would you have time to create a PR on this? If not I can do this.

mjacobse commented 11 months ago

I implemented the second part in #155. For the first part I wasn't quite sure what to do with the seed and if we should apply it for the other tests that use rand() as well so I left it as is for now.