Closed mperrin closed 6 years ago
@douglase here's the PR I promised for you to take a look at. In particular I'm hoping you can confirm that my various edits to your code didn't break anything, or change the performance negatively. (Looks good as far as I can tell, but I'm cautious until I hear it's all working on your hardware as well)
Yeah I realized last night after submitting this that I had accidentally left a bunch of debugging print statements in there unintentionally. Those should mostly just come out entirely.
I suspect the unnecessary IO is part of what's causing the slowdowns.
oh, actually it's worse than just unnecessary IO, since those debug statements were also summing the total intensity before and after each FFT. This was leftover from when I was trying to debug some normalization issues.
plus I wasn't using your _fftshift wrapper properly.
a6b632861ad278931c849d343f18e6b743a74139 runtimes are looking good
I was able to run the test suite a dozen times in a row with no memory errors, which is far beyond the problems I was seeing before. I'm almost entirely convinced the leak was indeed due to failure to cleanup GPU state when calculations error out part way through. Did you encounter anything like that before?
In any case I re-enabled the plan cache, using the version where it's file global inside accel_math
, so it should not interfere with Wavefront copying or pickling.
@douglase I'm going to go ahead and merge this PR now. I had hoped to get this poppy release done and out the door this week. At this point I'm not expecting to do any additional technical work on this in the near term - need to concentrate efforts on writing up for SPIE etc.
Incidentally, on my iMac Pro I've ended up in an interesting and surprising situation where plain numpy appears to be outperforming both FFTW and OpenCL. Have tested and benchmarked like crazy since this seemed nuts to me, but it proves to be the case. Apparently there's major speedups in numpy's intrinsic FFT in the Intel-optimized MKL-linked version of numpy which is now in Conda. This ends up being a very big deal on the iMac Pro since its CPU (Xeon W 2140) supports the AVX-512 SIMD extension, i.e. each core can operate on the equivalent of 4 * complex128s at once. On this particular hardware that ends up equalling or slightly beating the current result of FFTW (not as well optimized for the SIMD instructions as Intel's hand-tuned code?), and substantially outdoing even this fairly high-end AMD GPU (which is tuned much more for single-precision graphics performance than for double-precision GPGPU work; the chip has many fewer double-precision-capables cores than its total number of cores.)
Surprising but true, and goes to show that performance optimization has to be tuned for each set of hardware. Totally different optimal paths for my MacBook Pro laptop, iMac Pro desktop, and your EWS-instance GPGPU compute instances... Makes for a more complicated story but appears to be the case.
This branch both adds support for OpenCL and refactors the CUDA implementation somewhat.
accel_math.py
, inside wrapper functions, in particular one calledfft_2d
. This allows the algorithmic code in the other files to be written without having to concern itself with how a given FFT is implemented. In principle this extends and generales the_fft
and_inv_fft
methods on the FresnelWavefront class (which in fact just become stubs that pass through toaccel_math.fft_2d
. This change involved refactoring key algorithmic parts of bothfresnel.py
andpoppy_core.py
.conda install -c ljbo3 gpyfft
. A new settingpoppy.conf.use_cuda
allows enabling or disabling this code, on machines with OpenCL-compatible GPUs.poppy.conf.double_precision
which defaults to True. Many things work in single precision mode but currently many test cases fail, presumably since they're not smart enough to adjust tolerances appropriately. Best to ignore this setting for now for most purposes.All tests pass locally for me, using any combo of plain numpy, numexpr+fftw, CUDA, or OpenCL. On the other hand this starts to get very machine-dependent in test setups so further testing is much appreciated.
Some areas for future work before merging this:
accel_math.py
, but for unclear reasons doing so led to a memory leak on the GPU (buffers not being freed somehow?), which eventually led to calculations totally failing due to inability to malloc new buffers on the GPU. I couldn't debug where the leak was, but simply not caching the plans avoids it entirely. The time to create a new CUDA FFT plan each time appears < 1 ms, so it's not zero but is relatively negligible. I'm not totally happy with this but reliability comes first.