vincefn / pyvkfft

Python interface to VkFFT
MIT License
51 stars 6 forks source link

pyvkfft - python interface to the CUDA and OpenCL backends of VkFFT (Vulkan Fast Fourier Transform library)

VkFFT <https://github.com/DTolm/VkFFT>_ is a GPU-accelerated Fast Fourier Transform library for Vulkan/CUDA/HIP/OpenCL.

pyvkfft offers a simple python interface to the CUDA and OpenCL backends of VkFFT, compatible with pyCUDA, CuPy and pyOpenCL.

The documentation can be found at https://pyvkfft.readthedocs.io

Installation

Install using pip install pyvkfft (works on macOS, Linux and Windows). See below for an installation using conda-forge, or for an installation from source.

Notes:

Requirements:

This package can be installed from source using pip install ..

Note: python setup.py install is now disabled, to avoid messed up environments where both methods have been used.

Installation using conda ^^^^^^^^^^^^^^^^^^^^^^^^

You can use conda (or much faster mamba <https://mamba.readthedocs.io>_) to install pre-compiled binaries with CUDA and OpenCL support on linux-x86_64, linux-aarch64, linux-ppc64le, win-amd64, macos-x86_64, macos-arm64 platforms.

.. code-block:: shell

conda config --add channels conda-forge conda install pyvkfft

Note regarding CUDA support: there are multiple package versions of pyvkfft available, with either only OpenCL support, or compiled using the cuda nvrtc library versions 11.2, 11.8 or 12.x. If you want cuda support, you can install pyvkfft while using the cuda-version meta-package to select a specific cuda version. For example:

.. code-block:: shell

Only install pyvkfft, select cuda nvrtc 11.2

conda install pyvkfft cuda-version=11.2

Install pyvkfft, pyopencl, cupy with nvrtc version 12

conda install pyvkfft pyopencl cupy cuda-version=12

The only constraint is that the cuda driver must be more recent than the cuda nvrtc version requested installed (type conda info or mamba info to see conda's detected __cuda variable).

See more information in conda-forge's documentation <https://conda-forge.org/docs/maintainer/knowledge_base.html#cuda-builds>_

Once installed, you can use the pyvkfft-info script to see which languages, backends (pyopencl, pycuda, cupy) and GPU devices are available.

Installation from source (git) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: shell

git clone --recursive https://github.com/vincefn/pyvkfft.git cd pyvkfft pip install .

As indicated above, you can use environmental variables VKFFT_BACKEND and VKFFT_MAX_FFT_DIMENSIONS during the pip install to select the backend or the maximum number of transformed dimensions.

Examples

The simplest way to use pyvkfft is to use the pyvkfft.fft interface, which will automatically create the VkFFTApp (the FFT plans) according to the type of GPU arrays (pycuda, pyopencl or cupy), and also cache these apps:

.. code-block:: python

import pycuda.autoprimaryctx import pycuda.gpuarray as cua from pyvkfft.fft import fftn import numpy as np

d0 = cua.to_gpu(np.random.uniform(0,1,(200,200)).astype(np.complex64))

This will compute the fft to a new GPU array

d1 = fftn(d0)

An in-place transform can also be done by specifying the destination

d0 = fftn(d0, d0)

Or an out-of-place transform to an existing array (the destination array is always returned)

d1 = fftn(d0, d1)

See the scripts and notebooks in the examples directory. An example notebook is also available on google colab <https://colab.research.google.com/drive/1YJKtIwM3ZwyXnMZfgFVcpbX7H-h02Iej?usp=sharing>_. Make sure to select a GPU for the runtime.

Features

Performance

See the benchmark notebook, which allows to plot OpenCL and CUDA backend throughput, as well as compare with cuFFT (using scikit-cuda) and clFFT (using gpyfft).

The pyvkfft-benchmark script is available to make simple or systematic testss, also allowing to compare with cuFFT and clFFT.

Example results for batched 2D, single precision FFT with array dimensions of batch x N x N using a V100:

.. image:: https://raw.githubusercontent.com/vincefn/pyvkfft/master/doc/benchmark-2DFFT-NVIDIA-Tesla_V100-Linux.png

Notes regarding this plot:

The general results are:

Another example on an A40 card (only with radix<=13 transforms):

.. image:: https://raw.githubusercontent.com/vincefn/pyvkfft/master/doc/benchmark-2DFFT-NVIDIA-Tesla_A40-Linux-radix13.png

On this card the cuFFT is significantly better, even if the 11 and 13 radix transforms supported by vkFFT give globally better results.

Performance tuning ^^^^^^^^^^^^^^^^^^ Starting with VkFFT 1.3.0 and pyvkfft 2023.2, it is possible to tweak low-level parameters including coalesced memory or warp size, batch grouping, number of threads, etc...

Optimising those is difficult, so only do it out of curiosity or when trying to get some extra performance. Generally, VkFFT defaults work quite well. Using the simple FFT API, you can activate auto-tuning by passing tuning=True to the transform functions (fftn, rfftn, etc..). Only do this when using iterative process which really require fine-tuning !

Here is an example of the benchmark ran on a V100 GPU by tuning the coalescedMemory parameter (default value=32):

.. image:: https://raw.githubusercontent.com/vincefn/pyvkfft/master/doc/benchmark-V100-cuda-2D-coalmem.png

As you can see the optimal value varies with the 2D array size: below n=1536, using coalescedMemory=64 gives the best results, 32 (default) is best between 1536 and 2048, and above that there is little difference between the values chosen.

The same test on an A40 shows little difference. On an Apple M1 pro, it is the aimThreads parameter which is better tuned from 128 (default) to 64 to yield up to 50% faster transforms. YMMV !

Accuracy

See the accuracy notebook, which allows to compare the accuracy for different FFT libraries (pyvkfft with different options and backend, scikit-cuda (cuFFT), pyfftw), using pyfftw long-double precision as a reference.

Example results for 1D transforms (radix 2,3,5 and 7) using a Titan V:

.. image:: https://raw.githubusercontent.com/vincefn/pyvkfft/master/doc/accuracy-1DFFT-TITAN_V.png

Analysis:

You can easily test a transform using the pyvkfft-test command line script, e.g.: pyvkfft-test --systematic --backend pycuda --nproc 8 --range 2 4500 --radix --ndim 2

Use pyvkfft-test --help to list available options.

You can use the pyvkfft-test-suite script to run the comprehensive test suite which is used to evaluate pyvkfft before a new release. Several options are available to target specific (C2C, R2C..) transforms or even run a random subset of transform sizes for fast detection of issues.

TODO

Authors & acknowledgements