mpicbg-scicomp / gearshifft_publication

Publication Manuscript for results obtained with gearshifft
Other
0 stars 1 forks source link

Non power of 2 transforms #36

Closed psteinb closed 7 years ago

tdd11235813 commented 7 years ago

thanks for your work again, this is another important issue. I just thought about the explanation why powerof2 performs better than others. From that paragraph it sounds like it would be just due to a caching issue or memory hierarchy issue. In the motivation section the shapes are defined as well as a very coarse explanation is given how FFTs usually work. Maybe a hint to the fundamental differences is helpful, i.e. that different algorithms applied, between radix-2 and radix-r shapes (I have focused on Cooley Tukey implementation). One source is just a talk you can find here (slides 5-6). Another source could be an nvprof output:

// extent=2^5=32
void vectorRadix2_kernel<float>(vectorRadix2_st<float>) // used kernel to solve FFT and iFFT
// extent=3^3=27
void spRadix0027B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=6, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>)
void spRadix0027B::kernel1Mem<unsigned int, float, fftDirection_t=1, unsigned int=32, unsigned int=6, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>)

This most likely refers to split-radix implementation mentioned in the slides before.

I also wanted to profile clfft, but noticed: "The support for command-line profiler using the environment variable COMPUTE_PROFILE has been dropped in the CUDA 8.0 release." :(

psteinb commented 7 years ago

first of all, thanks for reviewing this PR. Let me separate my replies based on topic:

Finally, we should mention that there are many FFTs entirely distinct from Cooley-Tukey. Three notable such algorithms are the prime-factor algorithm for gcd(n_1 , n_2 ) = 1 [27,page 619], along with Rader’s [28] and Bluestein’s [27], [29] algorithms for prime n. FFTW implements the first two in its codelet generator for hard-coded n (Section VI) and the latter two for general prime n.

If I find the time today, I'll look into that.

psteinb commented 7 years ago

I updated this PR with a new stab at the interpretation of the plots

tdd11235813 commented 7 years ago

thanks, sounds good, I'll check it within my proof reading part now, but you can continue with writing, or let me know when to merge. I will come up with a separate PR after the merge and proof reading.

psteinb commented 7 years ago

ok, then I'll merge now and continue writing. thanks a bunch -