Closed brian-kelley closed 5 months ago
Thanks for opening the issue, Brian!!! Is there trait which says, associative reordering is okay? Although in distributed scenario I wonder how it would be not okay.
@vmiheer It should be safe to assume that it's always OK, since without that assumption Kokkos couldn't do parallel reduce or scan. Do the MPI one-sided atomics give you a choice in that?
I was looking at semantics for https://docs.nvidia.com/nvshmem/archives/nvshmem-113/api/docs/gen/mem-model.html#differences-between-nvshmem-and-openshmem and was wondering. Although this is question for later. For now I am going for MPI_Accumulate.
MPI RMA accumulate operations are only defined on operator/type pairs that are associative.
@vmiheer I see, so this is about how atomic operations are ordered. I wasn't familiar with this detail but the nvshmem behavior is actually different from Kokkos core, where if you do the two atomic fetch-adds from the same thread then they will always execute in order.
But KRS is supposed to be fully portable, so it only makes guarantees that all its backends make. And KRS doesn't add fences to all the nvshmem atomics (I assume this would be horrible for performance). So for the MPISpace backend, you don't have to worry about how atomics are ordered.
@janciesko Sorry, I didn't notice this was already an open issue!
The operator overloads for proxy type MPIDataElement with atomic trait have the same implementation as the non-atomic version, for example at https://github.com/kokkos/kokkos-remote-spaces/blob/df4c46a1489909b257a74854c481af6932eb817e/src/impl/mpispace/Kokkos_MPISpace_Ops.hpp#L513
These operators should use the one-sided atomic functions: MPI_Accumulate, MPI_Get_accumulate, or MPI_Fetch_and_op.
MPI_Compare_and_swap in a loop could be used for non-builtin operations like in Desul.