Open brandongc opened 1 year ago
Listing a few use cases
do concurrent (i=1:n_tiles)
nonzeros(i) = count_tile_nonzeros(i)
end do
offsets = exclusive_prefix_sum(nonzeros)
do concurrent (i=1:n_tiles)
A(offset(i)+1:offset(i)) = compute_tile_elements(i)
end do
Lists and goes in detail for several of
PACK
intrinsic can be implemented with a scan operation on the mask
exclusive_scan
, inclusive_scan
)More prior art:
More prior art: classic Cray vector machines descended from later X-MPs had a "compressed index" instruction that returned a vector of the positions of the bits that were set in a mask, and (on a much later machine) an instruction that was a true exclusive prefix scan over the bits of a mask. Both were handy for vectorizing if/then/else constructs (in different ways).
A scan operation takes a sequence of n elements [a_0, a1, …, a{n-1}] and a binary associative operator op as input and produces a second sequence containing the sums of prefixes. ie
Scans are important components of many use cases including sparse data structures, histograms, sorting. Parallel algorithms are available to exploit multiple levels of parallelism: SIMD, SIMT, Device, distributed. However efficient implementation of these algorithms is non-trivial and typically not portable between different levels or even devices from the same vendor.
An important generalization is the segmented scan which is the collective application of M scan operations to a single array with M segments.
Similar to #224, but instead of modification to
do concurrent
add intrinsic routines.In HPF there are:
HPF APIs