MPISpace: element operators with atomic trait aren't atomic

kokkos / kokkos-remote-spaces

Distributed View Extension for Kokkos

Other

43 stars 17 forks source link

MPISpace: element operators with atomic trait aren't atomic #99

Closed brian-kelley closed 5 months ago

brian-kelley commented 6 months ago

The operator overloads for proxy type MPIDataElement with atomic trait have the same implementation as the non-atomic version, for example at https://github.com/kokkos/kokkos-remote-spaces/blob/df4c46a1489909b257a74854c481af6932eb817e/src/impl/mpispace/Kokkos_MPISpace_Ops.hpp#L513

  KOKKOS_INLINE_FUNCTION
  void inc() const {
    T val = T();
    mpi_type_g(val, offset, pe, *win);
    val++;
    mpi_type_p(val, offset, pe, *win);
  }

These operators should use the one-sided atomic functions: MPI_Accumulate, MPI_Get_accumulate, or MPI_Fetch_and_op.

MPI_Compare_and_swap in a loop could be used for non-builtin operations like in Desul.

vmiheer commented 6 months ago

Thanks for opening the issue, Brian!!! Is there trait which says, associative reordering is okay? Although in distributed scenario I wonder how it would be not okay.

brian-kelley commented 6 months ago

@vmiheer It should be safe to assume that it's always OK, since without that assumption Kokkos couldn't do parallel reduce or scan. Do the MPI one-sided atomics give you a choice in that?

vmiheer commented 6 months ago

I was looking at semantics for https://docs.nvidia.com/nvshmem/archives/nvshmem-113/api/docs/gen/mem-model.html#differences-between-nvshmem-and-openshmem and was wondering. Although this is question for later. For now I am going for MPI_Accumulate.

devreal commented 6 months ago

MPI RMA accumulate operations are only defined on operator/type pairs that are associative.

brian-kelley commented 6 months ago

@vmiheer I see, so this is about how atomic operations are ordered. I wasn't familiar with this detail but the nvshmem behavior is actually different from Kokkos core, where if you do the two atomic fetch-adds from the same thread then they will always execute in order.

But KRS is supposed to be fully portable, so it only makes guarantees that all its backends make. And KRS doesn't add fences to all the nvshmem atomics (I assume this would be horrible for performance). So for the MPISpace backend, you don't have to worry about how atomics are ordered.

janciesko commented 5 months ago

brian-kelley commented 5 months ago

@janciesko Sorry, I didn't notice this was already an open issue!