mratsim / Arraymancer

A fast, ergonomic and portable tensor library in Nim with a deep learning focus for CPU, GPU and embedded devices via OpenMP, Cuda and OpenCL backends
https://mratsim.github.io/Arraymancer/
Apache License 2.0
1.34k stars 95 forks source link

OpenMP: Memory optimization of intermediate result #261

Open mratsim opened 6 years ago

mratsim commented 6 years ago

To avoid false sharing / bank conflict / cache trashing when multiple threads read and write data in the same cache line, an intermediate array is used with intermediate values padded so that they take different cache line. (See #139)

Tensor x: 1_000_000 float
             |
             |
             V
intermediate reduction: 4 floats (for a quad-core CPU)
             |
             |
             v
Final result:  1 float

Given that cache lines are all 64 Byte, to reduce memory consumption elements of type T should be spaced by n, n = min(64 / sizeof(T), 1) ==> 8 for float64/int64, 16 for float32/int32, 1 for an uint256 for example.

Unfortunately, sizeof(T) works for primitive types and since https://github.com/nim-lang/Nim/pull/8445 for arrays of primitive types. However it is still pending https://github.com/nim-lang/Nim/pull/5664 for Tensor of custom objects.

In the mean time, the intermediate array elements are arbitrarily spaced by 16 (maxItemsPerCacheLine) which waste lots of space when T is a tensor (size a couple hundred bytes).

https://github.com/mratsim/Arraymancer/blob/2a3c406261eeceffaac33bbde0eed691bae16103/src/tensor/backend/openmp.nim#L73-L92

mratsim commented 6 years ago

Implemented in https://github.com/nim-lang/Nim/pull/9356, note that https://github.com/nim-lang/Nim/pull/9493 is an alternative by allowing each threads to keep a local variable and reduce the partial reduction in an #pragma omp critical section.

This would allow much better scaling on proc with more than 8 cores (the current reduction block limit).

This is being implemented in Arraymancer future backend laser: parallel reduction example.

Note that runtime allocation is probably better anyway to support a high number of cores while keeping the binary small similar to the sum kernel: https://github.com/numforge/laser/blob/e9bbb1055517c6e76649b67737d7b8807af22eb2/laser/hpc_kernels/sum.nim#L56-L60, only issue is on embedded.