Open douglase opened 4 years ago
for a 512x512x2000 complex128 test case, the default numexpr and numpy run times for the exp
function are ~8 sec and 180 sec, on our test machine (AMD EPYC 7642 48-Core Processor).
using Numba compiled for a GPU on a V100 GPU is faster, about 1.5 sec
Numexpr; however, does not default to large thread counts.
If we increase the thread count to closer to the number of cores available on our machine (90 in this test case), we gain approximately another factor of 10 in runtime, bringing function to >0.2 sec, or almost 10^3 times faster than numpy on the same machine:
looks like the next low hanging fruit is calls to np.reduce()
the exponential function is the slowest aspect of the the beamlet propagation. There are several dimensions we could optimize over, propagations per second, propagations per watt, development time, etc... For now l'll focus on run-time.