j3-fortran / fortran_proposals

Proposals for the Fortran Standard Committee
175 stars 14 forks source link

intrinsics for scan (prefix sum) #273

Open brandongc opened 1 year ago

brandongc commented 1 year ago

A scan operation takes a sequence of n elements [a_0, a1, …, a{n-1}] and a binary associative operator op as input and produces a second sequence containing the sums of prefixes. ie

b(1) = 0
do i = 1, n-1
    a(i+1) = a(i) + b(i)
end do

Scans are important components of many use cases including sparse data structures, histograms, sorting. Parallel algorithms are available to exploit multiple levels of parallelism: SIMD, SIMT, Device, distributed. However efficient implementation of these algorithms is non-trivial and typically not portable between different levels or even devices from the same vendor.

An important generalization is the segmented scan which is the collective application of M scan operations to a single array with M segments.

Similar to #224, but instead of modification to do concurrent add intrinsic routines.

In HPF there are:

<op>_PREFIX(ARRAY, DIM, MASK, SEGMENT, EXCLUSIVE)
<op>_SUFFIX(ARRAY, DIM, MASK, SEGMENT, EXCLUSIVE)

HPF APIs

brandongc commented 1 year ago

Listing a few use cases

Converting counts to offsets

  do concurrent (i=1:n_tiles)
     nonzeros(i) = count_tile_nonzeros(i)
  end do

  offsets = exclusive_prefix_sum(nonzeros)

  do concurrent (i=1:n_tiles)
     A(offset(i)+1:offset(i)) = compute_tile_elements(i)
  end do

Sparse Linear Algebra

https://en.wikipedia.org/wiki/Prefix_sum#Applications

Prefix Sums and Their Applications - G. Blelloch

Lists and goes in detail for several of

pack intrinsic

PACK intrinsic can be implemented with a scan operation on the mask

Prior Art

w6ws commented 1 year ago

More prior art:

klausler commented 1 year ago

More prior art: classic Cray vector machines descended from later X-MPs had a "compressed index" instruction that returned a vector of the positions of the bits that were set in a mask, and (on a much later machine) an instruction that was a true exclusive prefix scan over the bits of a mask. Both were handy for vectorizing if/then/else constructs (in different ways).