Open mratsim opened 6 years ago
Implemented in https://github.com/nim-lang/Nim/pull/9356, note that https://github.com/nim-lang/Nim/pull/9493 is an alternative by allowing each threads to keep a local variable and reduce the partial reduction in an #pragma omp critical
section.
This would allow much better scaling on proc with more than 8 cores (the current reduction block limit).
This is being implemented in Arraymancer future backend laser: parallel reduction example.
Note that runtime allocation is probably better anyway to support a high number of cores while keeping the binary small similar to the sum kernel: https://github.com/numforge/laser/blob/e9bbb1055517c6e76649b67737d7b8807af22eb2/laser/hpc_kernels/sum.nim#L56-L60, only issue is on embedded.
To avoid false sharing / bank conflict / cache trashing when multiple threads read and write data in the same cache line, an intermediate array is used with intermediate values padded so that they take different cache line. (See #139)
Given that cache lines are all 64 Byte, to reduce memory consumption elements of type T should be spaced by n,
n = min(64 / sizeof(T), 1)
==> 8 for float64/int64, 16 for float32/int32, 1 for an uint256 for example.Unfortunately, sizeof(T) works for primitive types and since https://github.com/nim-lang/Nim/pull/8445 for arrays of primitive types. However it is still pending https://github.com/nim-lang/Nim/pull/5664 for Tensor of custom objects.
In the mean time, the intermediate array elements are arbitrarily spaced by 16 (
maxItemsPerCacheLine
) which waste lots of space when T is a tensor (size a couple hundred bytes).https://github.com/mratsim/Arraymancer/blob/2a3c406261eeceffaac33bbde0eed691bae16103/src/tensor/backend/openmp.nim#L73-L92