This PR suggests an implementation of optimisation where number of computations in kernel is reduced by processing blocks of frequencies (chunks). Demonstrates 25% speedup for BetatronExample on AMD Radeon R9 M370X but almost none on NVidia RTX. Not sure its worth extending it over all kernels for loss of readability.
This PR suggests an implementation of optimisation where number of computations in kernel is reduced by processing blocks of frequencies (chunks). Demonstrates 25% speedup for BetatronExample on AMD Radeon R9 M370X but almost none on NVidia RTX. Not sure its worth extending it over all kernels for loss of readability.