vincefn / pyvkfft

Python interface to VkFFT
MIT License
51 stars 6 forks source link

Zero-padding support? #12

Open tridao opened 2 years ago

tridao commented 2 years ago

Hi,

VkFFT says they support zero padding: "Native zero padding to model open systems (up to 2x faster than simply padding input array with zeros). Can specify the range of sequences filled with zeros and the direction where zero padding is applied (read or write stage)"

Is this available through the Python interface? Our use case requires us to pad the input and padding can take as long as the FFT itself, so native zero padding would indeed speed it up by 2x.

vincefn commented 2 years ago

I have not tried this, so it is not implemented in python yet. @Dtolm can you clarify one point: if I read the manual correctly the zero-padding option avoids reading the padded areas and the corresponding calculations. But the buffer size still has to be the full (padded) array, is that correct ?

DTolm commented 2 years ago

@vincefn yes, this is correct.

In some cases, it is actually possible to configure the buffer to be smaller: one-upload sizes, zero-padding from some N/2 to N (or something similar to N/2, but must go to the end of sequence) and custom strides. Though I wouldn't go this far in the beginning.

vincefn commented 2 years ago

@tridao you mentioned that performing the padding required as much time as the FFT, so is it still interesting since you have to pad the array anyway ? What kind of array dimensions are you working with ?

tridao commented 2 years ago

I'm thinking of the out-of-place case, where we can keep the input buffer as is and just allocate a larger buffer for output. Allocation is probably much faster than zero-filling the buffer. For example, let's say I have an input of size 512, and I want to do 1D FFT C2C of size 1024. So can i keep the input buffer of size 512 ComplexSize, allocate an output buffer of size 1024 ComplexSize, and call VkFFT?

I'm working with input dimension (batch_size, 512) and I'd like to do 1D FFT of size 1024. Input batch size is around 16k.

DTolm commented 2 years ago

This one is possible in VkFFT (this is what I described above as an example). An in-place version where the padded area is not zero-initialized (only assumed to be) is possible as well.

psobolewskiPhD commented 2 years ago

I assume the OP is asking in regards to avoiding circular convolution or something similar, but in a related note does pyVKFFT (and/or VKFFT) do zero-padding to the next fast length, ala: https://docs.scipy.org/doc/scipy/reference/generated/scipy.fft.next_fast_len.html#scipy.fft.next_fast_len

DTolm commented 2 years ago

The concept of fast lengths is also true in VkFFT as sequences that are decomposable as multiplication of primes up to 13 will be faster. Zero-padding is a separate optimization and it is more flexible as the users can specify to which size VkFFT should pad. Taking example from your link, yes, VkFFT can pad from 93059 to 93312 with zeros. It can also pad from 93059 to 131072. It can also assume that in 93059 sequence elements from 5 to 90594 (or any other numbers) are zeros.

psobolewskiPhD commented 2 years ago

I wanted to clarify this so on a lark I tried to run the example from scipy:

import numpy as np
import pyopencl as cl
import pyopencl.array as cla
import pyvkfft.opencl
from pyvkfft.fft import irfftn, rfftn
from scipy import fft

rng = np.random.default_rng()
min_len = 93059
a = rng.standard_normal(min_len)

import pyopencl.tools as cltools
mempool = cltools.MemoryPool(cltools.ImmediateAllocator(cq))

dr = cla.to_device(cq, a.astype(np.float32))
%timeit dc = rfftn(dr)

This gives an error: [RuntimeError: VkFFT error 3003: VKFFT_ERROR_UNSUPPORTED_FFT_LENGTH_R2C R2C (93059,) float32 norm=1 [opencl]]()

if I change min_len to 93060 then everything is fine.

Using scipy, there is a marked difference between 93060 and next_fast_len 93312: about 50%. But using pyvkFFT using:

z = np.pad(a, [126, 126])
dr2 = cla.to_device(cq, z.astype(np.float32))

vs the original 93060 makes no difference—it's 30x faster than scipy next_fast_len. So I guess as long as the size isn't prime, then the padding to fast lengths is automatic on the backend?

DTolm commented 2 years ago

This gives an error:

VkFFT doesn't support multi-upload R2C/C2R for odd FFT sizes yet. So, this is why you are getting this error.

Using scipy, there is a marked difference between 93060 and next_fast_len 93312: about 50%.

There is performance difference between 93060 and 93312 in VkFFT as well (one is Bluestein sequence, other is a nice combination of 2s and 3s). For Nvidia 3070 FP32 R2C with a batch of 100:

VkFFT System: 93060x1x1 Batch: 100 Buffer: 35 MB avg_time_per_step: 2.350 ms std_error: 0.005 num_iter: 100 benchmark: 15471 scaled bandwidth: 59.0 real bandwidth: 118.0 VkFFT System: 93312x1x1 Batch: 100 Buffer: 35 MB avg_time_per_step: 0.744 ms std_error: 0.000 num_iter: 100 benchmark: 48964 scaled bandwidth: 186.8 real bandwidth: 373.6

psobolewskiPhD commented 2 years ago

There is performance difference between 93060 and 93312 in VkFFT as well (one is Bluestein sequence, other is a nice combination of 2s and 3s). For Nvidia 3070 FP32 R2C with a batch of 100:

Yup, I apologize, I had fucked up. using ipython magic %timeit may result in unpredictable behavior I think, because the command q can crash. I've reformulated my test function based on the benchmarking code in this repo:

def goout(ar, q=cq, repeat=10):
    for i in range(repeat):
        q.finish()
        d = rfftn(ar)
        d = irfftn(d)
        q.finish()
    return d

Now running the scipy example with pyvkFFT shows a difference. Here's the setup:

rng = np.random.default_rng()
min_len = 93060
a = rng.standard_normal(min_len)

import pyopencl.tools as cltools
mempool = cltools.MemoryPool(cltools.ImmediateAllocator(cq))

dr = cla.to_device(cq, a.astype(np.float32))
z = np.pad(a, [126, 126]) # this pads to  fft.next_fast_len(min_len, real=True)
dr2 = cla.to_device(cq, z.astype(np.float32))

Running: %timeit e = goout(dr, cq, 10) yields: 14.4 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) Running: %timeit f = goout(dr2, cq, 10) yields: 7.78 ms ± 1.23 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

So clearly zero-padding to next_fast is worth it speed wise.

Edit: Another question: what about 2D or 3D FFTs? 1) if the 2D are not square, then each dim should be padded to next fast right? 2) for 3D the z should also be padded to next fast with 2D zero-layers?

Edit2: to provide context, I'm playing with 3D image stack deconvolution. I know I need to pad to avoid circular convolutions.

DTolm commented 2 years ago

Edit: Another question: what about 2D or 3D FFTs?

As they use the same logic and same algorithms (like Bluestein's for big primes), the same next fast logic will hold true there as well.

tangjinchuan commented 2 years ago

Dear @DTolm , I would like to ask a question I faced regarding the padding. Let A be an all-ones matrix with Height 8, Width 2. A is stored in row-major for the total 16 elements. I would like to do the FFT for each column i.e. in a batch size of 2. However, this time the FFT size is 9 instead of 8 for each column (which is quite common in Matlab/octave/FFTW). Do I need to form a new matrix B which is size 9*2 first, then Copy A to the top corner of matrix B (and leave the last row uninitialised), and finally do the FFT with zeropadding setting where fft_zeropad_left[0]=8 and fft_zeropad_right[0]= any value that is above 8?

Currently, I tried to used A directly instead of B, and had no luck with the result. Is this a bad example of mis-alignment ? Thank you very much! Best wishes, Jinchuan Tang

DTolm commented 2 years ago

Dear Jinchuan Tang,

So, am I correct that you want to do batching in the inner-most dimension (the one where elements are consecutive)? Currently, VkFFT only supports this sort of batching with a hack - you initialize a 2D FFT and omit the first dimension with omitDimension[0]=1. Then you do zeropadding on the second dimension as you described. This should work for your A case. Feel free to contact me if it doesn't work.

I will add proper support for this layout in one of the next updates.

Best regards, Dmitrii Tolmachev

tangjinchuan commented 2 years ago

Dear Dmitrii Tolmachev, Thank you very much for your quick response. I am very looking forward to next updates. It would be great if VKFFT can also support (i)fft2/(i)fft3 regarding the layout so that can be mapped efficiently to support the functions similar to Octave/Matlab where, for fft(A,N): " If N is larger than the dimension along which the FFT is calculated, then X is resized and padded with zeros." and, for fft2(A, M,N): "The optional arguments M and N may be used specify the number of rows and columns of A to use. If either of these is larger than the size of A, A is resized and padded with zeros." If this is too much to ask, I guess I will use the route with matrix B as mentioned earlier. I tried the following but no luck (100% due to my bad understanding on how to use 2D VKFFT): VkFFTConfiguration configuration = {};//All members that are not initialized explicitly are zero-initialized. VkFFTApplication app = {}; configuration.FFTdim = 2; //FFT dimension, 1D, 2D or 3D configuration.omitDimension[0]=1;// configuration.size[1] = N; //FFT size configuration.performR2C = 1; configuration.doublePrecision=0; //out-of-place - we need to specify that input buffer is separate from the main buffer configuration.isInputFormatted = 1;// //configuration.isOutputFormatted = 1; configuration.numberBatches = batchSize; //configuration.inputBufferStride[0] = dim1();//configuration.size[0];// default if no specified; configuration.inputBufferStride[1] = dim1(); configuration.bufferStride[0] = cH; configuration.performZeropadding[1] = 1;// configuration.fft_zeropad_left[1] = 8;// //configuration.fft_zeropad_right[0] =10;// configuration.fft_zeropad_right[1] = 10;// Best wishes, Jinchuan Tang