Closed mratsim closed 4 years ago
Closed by #34
Final syntax (high-level exported wrapper TBD):
proc sumReduce(n: int): int =
var waitableSum: Flowvar[int]
parallelReduceImpl i in 0 .. n, stride = 1:
reduce(waitableSum):
prologue:
var localSum = 0
fold:
localSum += i
merge(remoteSum):
localSum += sync(remoteSum)
return localSum
result = sync(waitableSum)
init(Weave)
let sum1M = sumReduce(1000000)
echo "Sum reduce(0..1000000): ", sum1M
doAssert sum1M == 500_000_500_000
exit(Weave)
The reduction system should accomodate complex reductions, for example parallel variance, logsumexp or parallel softmax cross-entropy.
Arraymancer examples
In Arraymancer this is how parallel variance is done (https://github.com/mratsim/Arraymancer/blob/9a56648850fc34fdd9da1f9c6874a87ddecc8932/src/tensor/aggregate.nim#L100-L130):
On a whole tensor
On an axis
With a 2-pass logsumexp
Similarly for softmax cross-entropy implemented via parallel one-pass logsumexp
Laser example
This is a thin wrapper over OpenMP and relies on OpenMP static scheduling.
https://github.com/numforge/laser/blob/d1e6ae6106564bfb350d4e566261df97dbb578b3/examples/ex05_tensor_parallel_reduction.nim#L47-L59
Proposed syntax
Unfortunately Arraymancer syntax is a bit too magic with x and y appearing out of nowhere.
And it's hard to see when the element is a float32 or a Tensor[float32] for example. Instead a lightweight DSL could be written with something resembling that:
For a parallel sum
For a parallel variance
And the code transformation will generate a lazy tree of tasks that ends with