Unified GPU framework using either CUDA or OpenCL

mperrin commented 6 years ago

This branch both adds support for OpenCL and refactors the CUDA implementation somewhat.

Essentially all GPU and FFTW related functionality is now relocated into accel_math.py, inside wrapper functions, in particular one called fft_2d. This allows the algorithmic code in the other files to be written without having to concern itself with how a given FFT is implemented. In principle this extends and generales the _fft and _inv_fft methods on the FresnelWavefront class (which in fact just become stubs that pass through to accel_math.fft_2d. This change involved refactoring key algorithmic parts of both fresnel.py and poppy_core.py.
OpenCL-based implementation of GPU accelerated FFTs added. Dependencies are pyopencl, conda, gpyfft. The first two can be installed from conda-forge. The latter is in a user contributed conda package, so use conda install -c ljbo3 gpyfft. A new setting poppy.conf.use_cuda allows enabling or disabling this code, on machines with OpenCL-compatible GPUs.
Work in progress partial implementation of toggling between single and double precision floating point math, controlled by a new setting poppy.conf.double_precision which defaults to True. Many things work in single precision mode but currently many test cases fail, presumably since they're not smart enough to adjust tolerances appropriately. Best to ignore this setting for now for most purposes.

All tests pass locally for me, using any combo of plain numpy, numexpr+fftw, CUDA, or OpenCL. On the other hand this starts to get very machine-dependent in test setups so further testing is much appreciated.

Some areas for future work before merging this:

In refactoring the CUDA functionality out of the FresnelWavefront class, that lost the ability to cache CUDA FFT plans as part of FresnelWavefront objects. I tried caching them as a module-level variable in accel_math.py, but for unclear reasons doing so led to a memory leak on the GPU (buffers not being freed somehow?), which eventually led to calculations totally failing due to inability to malloc new buffers on the GPU. I couldn't debug where the leak was, but simply not caching the plans avoids it entirely. The time to create a new CUDA FFT plan each time appears < 1 ms, so it's not zero but is relatively negligible. I'm not totally happy with this but reliability comes first.
The single precision stuff is all use-at-your-own-risk. Needs more debugging and attention tothe unit test tolerances.
Needs docs!
Travis doesn't support CI for GPUs, so all the meaningful testing for this will need to be run locally. Need to investigate the available CUDA compute servers at STScI.
Benchmark functions would be nice to include. Probably a few relevant cases, perhaps (1) bare FFT (2) coronagraphic multi-plane OpticalSystem calc in Fraunhofer regime (3) Fresnel multi-plane calc.

mperrin commented 6 years ago

@douglase here's the PR I promised for you to take a look at. In particular I'm hoping you can confirm that my various edits to your code didn't break anything, or change the performance negatively. (Looks good as far as I can tell, but I'm cautious until I hear it's all working on your hardware as well)

coveralls commented 6 years ago

Coverage increased (+0.1%) to 63.672% when pulling db00838a65820f732278b6532fec20b7da80228d on opencl into 34ea05c8041594d3a5ab23befc16ba4b33fc87a3 on master.

mperrin commented 6 years ago

Yeah I realized last night after submitting this that I had accidentally left a bunch of debugging print statements in there unintentionally. Those should mostly just come out entirely.

I suspect the unnecessary IO is part of what's causing the slowdowns.

mperrin commented 6 years ago

oh, actually it's worse than just unnecessary IO, since those debug statements were also summing the total intensity before and after each FFT. This was leftover from when I was trying to debug some normalization issues.

mperrin commented 6 years ago

plus I wasn't using your _fftshift wrapper properly.

douglase commented 6 years ago

a6b632861ad278931c849d343f18e6b743a74139 runtimes are looking good

mperrin commented 6 years ago

I was able to run the test suite a dozen times in a row with no memory errors, which is far beyond the problems I was seeing before. I'm almost entirely convinced the leak was indeed due to failure to cleanup GPU state when calculations error out part way through. Did you encounter anything like that before?
In any case I re-enabled the plan cache, using the version where it's file global inside accel_math, so it should not interfere with Wavefront copying or pickling.

mperrin commented 6 years ago

@douglase I'm going to go ahead and merge this PR now. I had hoped to get this poppy release done and out the door this week. At this point I'm not expecting to do any additional technical work on this in the near term - need to concentrate efforts on writing up for SPIE etc.

Incidentally, on my iMac Pro I've ended up in an interesting and surprising situation where plain numpy appears to be outperforming both FFTW and OpenCL. Have tested and benchmarked like crazy since this seemed nuts to me, but it proves to be the case. Apparently there's major speedups in numpy's intrinsic FFT in the Intel-optimized MKL-linked version of numpy which is now in Conda. This ends up being a very big deal on the iMac Pro since its CPU (Xeon W 2140) supports the AVX-512 SIMD extension, i.e. each core can operate on the equivalent of 4 * complex128s at once. On this particular hardware that ends up equalling or slightly beating the current result of FFTW (not as well optimized for the SIMD instructions as Intel's hand-tuned code?), and substantially outdoing even this fairly high-end AMD GPU (which is tuned much more for single-precision graphics performance than for double-precision GPGPU work; the chip has many fewer double-precision-capables cores than its total number of cores.)

Surprising but true, and goes to show that performance optimization has to be tuned for each set of hardware. Totally different optimal paths for my MacBook Pro laptop, iMac Pro desktop, and your EWS-instance GPGPU compute instances... Makes for a more complicated story but appears to be the case.

mperrin / poppy

Unified GPU framework using either CUDA or OpenCL #250