trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.19k stars 566 forks source link

Tpetra: Map<>::getLocalElement cannot be called in a host-side parallel region #13445

Open MalachiTimothyPhillips opened 1 week ago

MalachiTimothyPhillips commented 1 week ago

Bug Report

@trilinos/tpetra @japlews @vbrunini

Tpetra::Map<>::getLocalElement (which according to the doxygen should be thread safe) may call Tpetra::Map<>::lazyPushToHost.

Tpetra::Map<>::lazyPushToHost performs several operations (Kokkos::deep_copy, Kokkos::fence) that are "illegal" from within a parallel region.

With the recent Kokkos 4.4 promotion, this may now cause host-side parallel regions to hang.

See:

https://github.com/trilinos/Trilinos/blob/b895aa6850943911c4b9388249a00a4d372f601e/packages/tpetra/core/src/Tpetra_Map_decl.hpp#L631

and

https://github.com/trilinos/Trilinos/blob/b895aa6850943911c4b9388249a00a4d372f601e/packages/tpetra/core/src/Tpetra_Map_def.hpp#L1251

Steps to Reproduce

Call Tpetra::Map<>::lazyPushToHost from within a host-side Kokkos::parallel_for.

japlews commented 1 week ago

@MalachiTimothyPhillips thanks a lot for reporting this. FYI, we hit this with Kokkos 4.4 integrations in Sierra. I would also add to the comments that there are several functions in Tpetra::Map that call lazyPushToHost; not limited to just getLocalElement. It's easy enough to work around, so this is not super urgent, but would be nice to fix.

csiefer2 commented 4 days ago

@MalachiTimothyPhillips @japlews We generally recommend avoiding calling Tpetra functions (excepting those with KOKKOS_FUNCTION decorators) in parallel regions, because that code will never work on a GPU.

As you note, the documentation about thread safety of that function is no longer correct. The change we neglected to document is an intentional one --- the lazyPushToHost stuff was designed to avoid adding a bunch of D2H calls in the middle of an otherwise 100% GPU calculations. Previously every Map construction involved those D2H calls.

If you're not building a GPU backend, we might be able to clean up the fences. But if you are building a GPU backend with a host-parallel front end, then those fences really do need to be there.

japlews commented 3 days ago

@csiefer2 This sounds good. I only advocate for improving the comments or otherwise enforcing somehow that we don't do this in threaded loops where a hang could occur!