openshmem-org / specification

OpenSHMEM Application Programming Interface
http://www.openshmem.org
51 stars 41 forks source link

Improve Noncontiguous APIs #365

Open jdinan opened 4 years ago

jdinan commented 4 years ago

Issue

The current interleaved communication routines in OpenSHMEM (shmem_iput/iget) transfer single element chunks that are a fixed stride apart (source and destination can have different strides). This API does not capture many noncontiguous data transfer patterns. For example, it is inefficient for applications that transfer array sections on two and higher dimensionality arrays.

Possible Solutions

Block Interleaved API

Extend the existing SHMEM interleaved APIs (e.g. shmem_iput) to include a block size. This will allow them to support 2d array slice transfers.

void shmem_ibput(TYPE *dest, const TYPE *source, ptrdiff_t dst, ptrdiff_t sst,
                 size_t blocksize, size_t nelems, int pe);
void shmem_iputmem(void *dest, const void *source, ptrdiff_t dst, ptrdiff_t sst,
                   size_t element_size, size_t nelems, int pe);

Strided APIs

Something similar to the ARMCI strided APIs could be used. This supports generic matrix slice transfers.

int ARMCI_PutS(void *src_ptr, int src_stride_ar[/*stride_levels*/],
               void *dst_ptr, int dst_stride_ar[/*stride_levels*/], 
               int count[/*stride_levels+1*/], int stride_levels, int proc);

Subarray APIs

Similar to the MPI subarray datatype. This supports generic matrix slice transfers.

void shmem_subarray_put(
        shmem_ctx_t ctx,
        TYPE *dest, size_t dest_ndim, size_t dest_dims[],
        size_t dest_start[], size_t dest_count[],
        TYPE *src, size_t src_ndim, size_t src_dims[],
        size_t src_start[], size_t src_count[],
        int pe);

The user specifies the full dimensions of the source and destination matrices and indicates the pointer to the zero'th element. The upper left and lower right indices of the source and destination slices are given to indicate the source/dest buffer.

This API has the advantage of being very easy for users to use (versus strided APIs, which require thinking about the linearization of the matrix). However, because data can be reshaped during the transfer, it also requires more work on the part of implementations.

Datatype API

Similar to MPI datatypes API. Introduce API for datatype creation and put/get APIs that take source and destination datatypes. An additional API could be used to inform the target about the datatype ahead of time:

shmem_dtype_commit(shmem_dtype_t type, shmem_team_t team, shmem_dtype_hints_t hints);
naveen-rn commented 4 years ago

Related: Extending Strided Communication Interfaces in OpenSHMEM Towards Matrix Oriented Strides in OpenSHMEM

jdinan commented 4 years ago

From 7/2/2020 meeting, WG prefers the block interleaved API. Would like to see strong drivers for strided APIs.

jeffhammond commented 4 years ago

Regarding the more general strided APIs...

From https://github.com/jeffhammond/oshmpi/blob/master/docs/oug2014_resubmission-acm_4.pdf:

It is worth asking whether it is worthwhile to generalize the APUT operation for dimensions higher than two to support tensor operations (for some applications, see [7] and [15]). There are two arguments against this. First, operations on subarrays of dimension greater than two can be expressed in terms of a single APUT operation by combining the strides; for example, a three-dimensional subarray operation can be cast in terms of a two-dimension subarray computation if the stride over x and y are multiplied together (here we assume z is the contiguous dimension that is captured by blockelems). Regardless of the number of dimensions associated with the strides, the key efficiency gain with APUT is accomplished by operating on blocks of contiguous data rather than single elements, as is the case for IPUT. Second, the myriad of applications involving tensor operations include many cases where cartesian subarrays are not useful. For example, in the domain of quantum chemistry, most tensors have permutation (anti-)symmetry and thus cannot make use of operations designed for non-symmetric subarrays. Such is the complexity of tensor data in the NWChem [3] Tensor Contraction Engine [6] that block-sparse and permutation- (anti)symmetric tensors are mapped to one-dimensional global arrays with an application-defined hashing scheme.

jeffhammond commented 4 years ago

If you are going to add 2D array support, you might want to think about collectives as well.

jdinan commented 4 years ago

Discussion at RMA WG today:

Interest in pursuing the datatypes, API. However, we would need a driver.

Possible drivers for noncontig APIs:

jdinan commented 4 years ago

@jeffhammond I don't understand the argument for dimensions higher than two using APUT. Are you calling it in a loop over the outer dimensions?

jeffhammond commented 1 year ago

I'm saying 2D is sufficient for cartesian arrays. 3D can be collapsed to 2D by multiplying the first two strides. And so forth. Or one can loop over 2D ops if somehow that doesn't work. The loop overhead isn't going to matter because a 2D operation is going to be relatively expensive.