mpicbg-scicomp / gearshifft

Benchmark Suite for Heterogenuous FFT Implementations
Apache License 2.0
34 stars 9 forks source link

Does fftw plan reusage makes sense? #57

Open tdd11235813 opened 7 years ago

tdd11235813 commented 7 years ago

FFTW_MEASURE means, that fftw overwrites the input and output buffers in the planning stage. After planning the buffers can be filled with data (memcpy). Plan reusage means to have only one plan at a time. For non-fftw_measure (estimate, wisdom) plans I think, it is NOT worthwhile to reuse fftw plans as they do not allocate temporary buffers (are we sure?). But: It might be worthwhile in terms of memcpy. We could save memcpy part as long as padding is not required. The input data coming from BenchmarkExecutor is aligned, but not padded with respect to FFT, so only for padding (Inplace Real2Complex) the memcpy part would be required. Have to look on the results w.r.t. upload and download times ..

psteinb commented 7 years ago

any news on this? I wondered if this idea is relevant for using FFTW or for interpreting the results of gearshifft?

tdd11235813 commented 7 years ago

to finally give an answer on this, I plotted Time of Upload vs Total Time to get the ratio.

rshiny-fftw-upload-total-ratio

upload refers to the memcpy operation and the timer measured a ~40% contribution to the total solution time at the worst case. But does this really comes from memcpy?

rshiny-fftw-download-total-ratio

download is the same memcpy operation, just in the other direction. It is smooth and fast, no significant times here. So the long upload time might come from a cache warmup. The memcpy can be avoided when transform is not a real-inplace and when fftw is not run with fftw-measure. It would reduce the total runtime by 40% if we assume the cache warmup to be the only responsible factor for the upload time.

The rshiny tool is going to get an update to examine such statistics. At the moment I do not plan to change fftw in gearshifft to avoid the memcopies in the aforementioned cases.

psteinb commented 7 years ago

thanks for the update. Interesting findings I believe.

Are these results from multi-threaded or single-threaded runs? I am asking as it doesn't need to be warm-up only, but (in a multi-threaded scenario) also cache line trashing.

tdd11235813 commented 7 years ago

true. this is multi-threaded. the single-threaded benchmark is still running on taurus. let's see what we will have there.