Possible issues when using blaze expressions

knelli2 commented 5 hours ago

I ran into an issue in #6359 where the SolveXcts KerrSchild regression test was failing, but only on some CI builds (mostly the Release builds) every time I pushed. After some debugging I found that the following diff allowed the test to pass

--- a/src/Domain/CoordinateMaps/TimeDependent/ShapeMapTransitionFunctions/SphereTransition.cpp
+++ b/src/Domain/CoordinateMaps/TimeDependent/ShapeMapTransitionFunctions/SphereTransition.cpp
@@ -88,7 +88,7 @@ template <typename T>
 T SphereTransition::call_impl(const std::array<T, 3>& source_coords) const {
   T mag = magnitude(source_coords);
   check_magnitudes(mag, false);
-  return blaze::clamp(a_ * mag + b_, 0.0, 1.0);
+  return blaze::clamp(T{a_ * mag + b_}, 0.0, 1.0);
 }

 template <typename T>

My guess is this is somehow related to blaze expressions (which a_ * mag + b_ is) because explicitly constructing a T (a DataVector) before passing to the blaze::clamp fixes the issue. However, something odd is that with and without the change, the maps and the Jacobians were identical to roundoff.

@nikwit Had the same test fail in the same way when he edited the shape map in #6227 (but only a single CI build was failing that time). Some things that "fixed" the symptom were just increasing LMax in the test, inlining a function, or adding some temporary allocations in a function (https://github.com/nikwit/spectre/commit/6ecb7300cc05bcb01d4e95334616fcb83f1612c2). See the PR for more details.

@isaaclegred encountered a somewhat similar issue regarding using auto with blaze expressions. The commit that details a "fix" is here. The basic idea was

std::vector<ModalVector> vector{};
auto value = /* some math expression */;
vector.push_back(value);

Changing the auto to ModalVector fixed the issue. The test that would fail with auto wouldn't always fail the same way. Sometimes it was a segfault, sometimes an FPE, sometimes just a comparison of two numbers would fail. These lines are no longer in the code because a different approach was taken.

Possible ways to debug this if you have a test case that fails:

Run any of the sanitizers on the test
- ASAN
- Threadsan
- Address sanitizer
Implement bounds checking in Blaze (in case stuff happens off the end of the vector)

wthrowe commented 4 hours ago

std::vector<ModalVector> vector{};
auto value = /* some math expression */;
vector.push_back(value);

I think this one is unrelated, since it was a lifetime problem, and nothing in the first example is changing lifetimes. The omitted details are really important here. There were five changed lines:

auto case_minus_1 = change_of_basis[i - 1];
auto case_minus_2 = change_of_basis[i - 2];

For these, auto was already ModalVector. It would be better if either const ModalVector& or const auto& were used, since these are doing unnecessary copies.

auto to_add_1 = 2.0 * shift_right(case_minus_1);

This one is definitely broken. The auto becomes an expression template that stores a reference to the ModalVector returned by shift_right. That ModelVector goes out of scope at the end of the expression, leaving a dangling reference.

auto to_add_2 = -1.0 * case_minus_2;
auto total = to_add_1 + to_add_2;

These both resolve to expression templates, and while the blaze code is difficult to interpret, I believe they are both fine. The -1.0 appears to be stored by value in the first case. (I believe everything else for these is also stored by value, but since none of it goes out of scope before the end of the block that doesn't matter either way.)

Really, for this case it would be much better to just write the whole thing in one line, which would save 5 memory allocations and possibly even do fewer arithmetic operations. (It is possible to save a sixth allocation, but that requires some messier code.)

wthrowe commented 4 hours ago

I should also add that it would be possible to write an expression template library where this sort of thing worked without the user having to worry about lifetime details, but blaze is not that.

sxs-collaboration / spectre

Possible issues when using blaze expressions #6376