VkFFT <https://github.com/DTolm/VkFFT>
_ is a GPU-accelerated Fast Fourier Transform library
for Vulkan/CUDA/HIP/OpenCL.
pyvkfft offers a simple python interface to the CUDA and OpenCL backends of VkFFT, compatible with pyCUDA, CuPy and pyOpenCL.
The documentation can be found at https://pyvkfft.readthedocs.io
Install using pip install pyvkfft
(works on macOS, Linux and Windows).
See below for an installation using conda-forge, or for an installation from source.
Notes:
pyopencl
if opencl is available. However you should manually install either cupy
or pycuda
to use the cuda backend.nvcc
installed but cuda is not actually available), you can do
that using e.g. VKFFT_BACKEND=opencl pip install pyvkfft
. By default the opencl
backend is always installed, and the cuda one if nvcc is found.VKFFT_MAX_FFT_DIMENSIONS=10 pip install pyvkfft
.Requirements:
pyopencl
and the opencl libraries/development tools for the opencl backendpycuda
or cupy
and CUDA developments tools (nvcc
, nvrtc
library)
for the cuda backendnumpy
conda
, as detailed belowOptional:
scipy
and pyfftw
for more accurate tests (and to test DCT/DST)This package can be installed from source using pip install .
.
Note: python setup.py install
is now disabled, to avoid messed up environments
where both methods have been used.
Installation using conda ^^^^^^^^^^^^^^^^^^^^^^^^
You can use conda
(or much faster mamba <https://mamba.readthedocs.io>
_)
to install pre-compiled binaries with CUDA and OpenCL support
on linux-x86_64, linux-aarch64, linux-ppc64le, win-amd64, macos-x86_64, macos-arm64
platforms.
.. code-block:: shell
conda config --add channels conda-forge conda install pyvkfft
Note regarding CUDA support: there are multiple package versions of
pyvkfft
available, with either only OpenCL support, or compiled using
the cuda nvrtc library versions 11.2, 11.8 or 12.x. If you want cuda support,
you can install pyvkfft
while using the cuda-version
meta-package to select
a specific cuda version. For example:
.. code-block:: shell
conda install pyvkfft cuda-version=11.2
conda install pyvkfft pyopencl cupy cuda-version=12
The only constraint is that the cuda driver must be more recent than the
cuda nvrtc version requested installed (type conda info
or mamba info
to see conda's detected __cuda
variable).
See more information in conda-forge's documentation <https://conda-forge.org/docs/maintainer/knowledge_base.html#cuda-builds>
_
Once installed, you can use the pyvkfft-info
script to see which
languages, backends (pyopencl, pycuda, cupy) and GPU devices are available.
Installation from source (git) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: shell
git clone --recursive https://github.com/vincefn/pyvkfft.git cd pyvkfft pip install .
As indicated above, you can use environmental variables
VKFFT_BACKEND
and VKFFT_MAX_FFT_DIMENSIONS
during the pip
install to select the backend or the maximum number of transformed
dimensions.
The simplest way to use pyvkfft is to use the pyvkfft.fft
interface, which will
automatically create the VkFFTApp (the FFT plans) according to the type of GPU
arrays (pycuda, pyopencl or cupy), and also cache these apps:
.. code-block:: python
import pycuda.autoprimaryctx import pycuda.gpuarray as cua from pyvkfft.fft import fftn import numpy as np
d0 = cua.to_gpu(np.random.uniform(0,1,(200,200)).astype(np.complex64))
d1 = fftn(d0)
d0 = fftn(d0, d0)
d1 = fftn(d0, d1)
See the scripts and notebooks in the examples directory.
An example notebook is also available on google colab <https://colab.research.google.com/drive/1YJKtIwM3ZwyXnMZfgFVcpbX7H-h02Iej?usp=sharing>
_.
Make sure to select a GPU for the runtime.
VKFFT_MAX_FFT_DIMENSIONS
when installing).axes=(-3,-1)
. For R2C transforms, the fast axis must be transformed.pyvkfft.cuda
for pycuda/cupy or pyvkfft.opencl
for pyopencl)
or by using the pyvkfft.fft
interface with the fftn
, ifftn
, rfftn
and irfftn
functions which automatically detect the type of GPU array and cache the
corresponding VkFFTApp (see the example notebook pyvkfft-fft.ipynb).pyvkfft-test
command-line script allows to test specific transforms against
expected accuracy values, for all types of transforms.pyvkfft results are evaluated before any release with a comprehensive test
suite, comparing transform results for all types of transforms: single and double
precision, 1D, 2D and 3D, inplace and out-of-place, different norms, radix and
Bluestein, etc... The pyvkfft-test-suite
script can be used to run the full suite,
which takes more than two days on an A40 GPU using up to 16 parallel process, with
about 1.5 million unit tests.
Here are the test results for pyvkfft 2024.1:
A40 cuda test results <http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2024-01-21-a40cu/pyvkfft-test.html>
_H100 opencl test results <http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2024-01-21-h100cl/pyvkfft-test.html>
_Apple M1 OpenCL test results <http://ftp.esrf.fr/pub/scisoft/PyNX/pyvkfft-test/pyvkfft-test-2024-01-21-apple-m1/pyvkfft-test.html>
_See the benchmark notebook, which allows to plot OpenCL and CUDA backend throughput, as well as compare with cuFFT (using scikit-cuda) and clFFT (using gpyfft).
The pyvkfft-benchmark
script is available to make simple or systematic testss,
also allowing to compare with cuFFT and clFFT.
Example results for batched 2D, single precision FFT with array dimensions of batch x N x N using a V100:
Notes regarding this plot:
The general results are:
Another example on an A40 card (only with radix<=13 transforms):
On this card the cuFFT is significantly better, even if the 11 and 13 radix transforms supported by vkFFT give globally better results.
Performance tuning ^^^^^^^^^^^^^^^^^^ Starting with VkFFT 1.3.0 and pyvkfft 2023.2, it is possible to tweak low-level parameters including coalesced memory or warp size, batch grouping, number of threads, etc...
Optimising those is difficult, so only do it out of curiosity or when trying to get some
extra performance. Generally, VkFFT defaults work quite well. Using the
simple FFT API, you can activate auto-tuning by passing tuning=True
to the
transform functions (fftn
, rfftn
, etc..). Only do this when using iterative
process which really require fine-tuning !
Here is an example of the benchmark ran on a V100 GPU by tuning the
coalescedMemory
parameter (default value=32):
.. image:: https://raw.githubusercontent.com/vincefn/pyvkfft/master/doc/benchmark-V100-cuda-2D-coalmem.png
As you can see the optimal value varies with the 2D array size: below
n=1536, using coalescedMemory=64
gives the best results, 32
(default)
is best between 1536 and 2048, and above that there is little difference
between the values chosen.
The same test on an A40 shows little difference. On an Apple M1 pro,
it is the aimThreads
parameter which is better tuned from 128 (default)
to 64 to yield up to 50% faster transforms. YMMV !
See the accuracy notebook, which allows to compare the accuracy for different FFT libraries (pyvkfft with different options and backend, scikit-cuda (cuFFT), pyfftw), using pyfftw long-double precision as a reference.
Example results for 1D transforms (radix 2,3,5 and 7) using a Titan V:
.. image:: https://raw.githubusercontent.com/vincefn/pyvkfft/master/doc/accuracy-1DFFT-TITAN_V.png
Analysis:
You can easily test a transform using the pyvkfft-test
command line script, e.g.:
pyvkfft-test --systematic --backend pycuda --nproc 8 --range 2 4500 --radix --ndim 2
Use pyvkfft-test --help
to list available options.
You can use the pyvkfft-test-suite
script to run the comprehensive
test suite which is used to evaluate pyvkfft before a new release. Several
options are available to target specific (C2C, R2C..) transforms or even
run a random subset of transform sizes for fast detection of issues.
access to the other backends:
VkFFT <https://github.com/DTolm/VkFFT>
_ author