lanl / singularity-eos

Performance portable equations of state and mixed cell closures
https://lanl.github.io/singularity-eos/
BSD 3-Clause "New" or "Revised" License
25 stars 8 forks source link

Issues with CRTP `Evaluate` on GPUs #393

Open pdmullen opened 1 month ago

pdmullen commented 1 month ago

I am experiencing an issue in a downstream app wherein an Evaluate call (see https://github.com/lanl/singularity-eos/blob/dcd7a6efd5acbfa53903f08f02aca7abfa5c3f5f/singularity-eos/eos/eos_base.hpp#L135-L139) leads to data corruption on GPUs (volta with nvhpc). I can seemingly only encounter the issue when working with Spiner EOS tables.

Printing data at addresses from explicit ptrs inside and outside the Evaluate call show differing values.

@jonahm-LANL asked me to make an issue --- we will work on isolating the problem to see if this is a downstream app or singularity issue. If the latter, I am sure a reproducer is in order...

jhp-lanl commented 1 month ago

Which pointers? The functor? The class pointer? Or data pointers specific to spiner?

Yurlungur commented 1 month ago

I think @pdmullen is refering to the pointers to data accessed. The call pattern that fails looks something like this:

// on device
Real *some_device_pointer = &my_device_view(0);
auto my_functor = [=](auto eos) {
  Kokkos::parallel_for(inner_range, [=](int it) {
    eos.SomeCall(pointer[i]);
  });
};
eos.Evaluate(my_functor);
jhp-lanl commented 1 month ago

I think @pdmullen is refering to the pointers to data accessed. The call pattern that fails looks something like this:

// on device
Real *some_device_pointer = &my_device_view(0);
auto my_functor = [=](auto eos) {
  Kokkos::parallel_for(inner_range, [=](int it) {
    eos.SomeCall(pointer[i]);
  });
};
eos.Evaluate(my_functor);

I assume that should be a KOKKOS_LAMBDA rather than a regular lambda?

Yurlungur commented 1 month ago

Nope because that whole call is inside a kokkos kernel already.

jhp-lanl commented 1 month ago

Nope because that whole call is inside a kokkos kernel already.

Evaluate isn't a PORTABLE_FUNCTION. Does it need to be if it's being called on device?

Yurlungur commented 1 month ago

That's why Evaluate is marked constexpr that's supposed to allow us to evaluate it on device. But it's possible something is going wrong there.

jhp-lanl commented 1 month ago

ahhhhhh gotcha