Non power of 2 transforms

tdd11235813 commented 7 years ago

thanks for your work again, this is another important issue. I just thought about the explanation why powerof2 performs better than others. From that paragraph it sounds like it would be just due to a caching issue or memory hierarchy issue. In the motivation section the shapes are defined as well as a very coarse explanation is given how FFTs usually work. Maybe a hint to the fundamental differences is helpful, i.e. that different algorithms applied, between radix-2 and radix-r shapes (I have focused on Cooley Tukey implementation). One source is just a talk you can find here (slides 5-6). Another source could be an nvprof output:

// extent=2^5=32
void vectorRadix2_kernel<float>(vectorRadix2_st<float>) // used kernel to solve FFT and iFFT
// extent=3^3=27
void spRadix0027B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=6, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>)
void spRadix0027B::kernel1Mem<unsigned int, float, fftDirection_t=1, unsigned int=32, unsigned int=6, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>)

This most likely refers to split-radix implementation mentioned in the slides before.

I also wanted to profile clfft, but noticed: "The support for command-line profiler using the environment variable COMPUTE_PROFILE has been dropped in the CUDA 8.0 release." :(

psteinb commented 7 years ago

first of all, thanks for reviewing this PR. Let me separate my replies based on topic:

"why powerof2 performs better than others" I am still debating with myself if we should dive into the explanations too deeply ... my biggest concern is time. the deadline is Friday and just 2 days ahead. On the other hand, this is what a benchmark can be used for as well, namely to study the performance profile of the entire library/application as an integrated system.
I like the idea of using nvprof to trying to find the source quickly. however, without seeing the source code and profiling it in detail, we will not find any decisive answers. On top, the function naming might give us a hint in favor of my one-off hypothesis (the name suggests a vector length of 27 which clearly won't fit a cache line) or evidence against it (the template parameters contain a lot of 32s which might hint at padded processing to match cache line lengths on the GPU). it really depends on the implementation. So here, I really don't want to start a war on windmills.
the link to the MIT talk: what a very good read! thanks.
concentrating on Cooley-Turkey is fine. the 2005 FFTW paper however clearly says that FFTW uses non-Cooley-Turkey implementations to a large extent.

Finally, we should mention that there are many FFTs entirely distinct from Cooley-Tukey. Three notable such algorithms are the prime-factor algorithm for gcd(n_1 , n_2 ) = 1 [27,page 619], along with Rader’s [28] and Bluestein’s [27], [29] algorithms for prime n. FFTW implements the first two in its codelet generator for hard-coded n (Section VI) and the latter two for general prime n.

If I find the time today, I'll look into that.

psteinb commented 7 years ago

I updated this PR with a new stab at the interpretation of the plots

tdd11235813 commented 7 years ago

thanks, sounds good, I'll check it within my proof reading part now, but you can continue with writing, or let me know when to merge. I will come up with a separate PR after the merge and proof reading.

psteinb commented 7 years ago

ok, then I'll merge now and continue writing. thanks a bunch -

mpicbg-scicomp / gearshifft_publication

Non power of 2 transforms #36