Closed fredRos closed 7 years ago
A useful list of tricks to reduce OS related noise https://github.com/JuliaCI/BenchmarkTools.jl/blob/master/doc/linuxtips.md for accurate timing
The timing output by amplifier seems unreliable. With 1 thread, code in libfftw3 takes 1.7 s, with 8 threads, it's 0.001 s. Doesn't make sense! To create a bar chart for the paper, I should output the time elapsed in individual operations both for a GPU and a CPU. ON the cpu, I would use omp_get_wtime()
. On the GPU, use the cuda events already in the code but with macros to avoid copying code as much as possible or this C++ struct https://github.com/aramadia/udacity-cs344/blob/master/Unit2%20Code%20Snippets/gputimer.h
We now have fine-grained timing output enable by cmake -DGALARIO_TIMING=1
. Timing of memory allocations and transfer still highly variable for GPU
Not really a problem but posting the results here for reference
Test problem
tools
use Intel amplifier_xe to identify basic hotspots on
serial
So major hotspot is applying the phase. Understandable becauce FFT scales O(n log(n)) and phase is O(n^2). The default version works in the cartesian plane because that's what FFTW expects. It seemed like applying a phase in polar coordinates is easier but it turns out it takes longer due to the conversion, so we should stick with what we have
comparison: polar takes longer!