Open rainwoodman opened 9 years ago
This is a brief version of r2c /c2r benchmark program.
There are issues. But I can already observe with a 384x384x384 mesh pfft is slower than fftw running on a single rank by about 10%.
I am not quite sure how to fix the large FFTW errors. Also needs a commandline flag to add PADDED support.
[yfeng1@waterfall tests]$ ./bench_r2c -pfft_n 384 384 384 -pfft_cmp_fftw -pfft_inplace -pfft_patience 0 -pfft_destroy_input ****************************************************************************************************** * Computation of loops=1 parallel forward and backward FFTs (change with -pfft_loops *) * for n[0] x n[1] x n[2] = 384 x 384 x 384 Fourier coefficients (change with -pfft_n * * *) * on np[0] x np[1] x np[2] = 1 x 1 x 1 processes (change with -pfft_np * * *) * with: * - non-transposed data layout (change with -pfft_transposed) * - non-verbose output (change with -pfft_verbose) * - in-place transforms (change with -pfft_inplace) * - disabled decomposition comparison (change with -pfft_cmp_decomp) * - enabled FFTW comparison (change with -pfft_cmp_fftw) * - disabled comparison of all planner flags (change with -pfft_cmp_flags) * - disabled output of internal PFFT timer (change with -pfft_timer) * - pfft_flags = PFFT_ESTIMATE | PFFT_NO_TUNE | PFFT_DESTROY_INPUT * (change with [-pfft_patience 0|1|2|3] [-pfft_tune] [-pfft_destroy_input]) ******************************************************************************************************* !!! Warning: inplace transforms do not support DESTROY_INPUT flag !!! * PFFT runtimes (1d data decomposition): Flags: PFFT_NO_TUNE, PFFT_ESTIMATE, PFFT_DESTROY_INPUT, tune_forw = 2.58e-03; tune_back = 2.56e-03, exec_forw/loops = 1.34e+00, exec_back/loops = 1.35e+00 error = 6.44e-14 * FFTW_MPI runtimes (1d data decomposition): Flags: FFTW_ESTIMATE, FFTW_PRESERVE_INPUT tune_forw = 2.89e-03; tune_back = 1.21e-04, exec_forw/loops = 1.21e+00, exec_back/loops = 1.21e+00 error = 9.48e+02 Flags: FFTW_MEASURE, FFTW_PRESERVE_INPUT tune_forw = 1.34e+01; tune_back = 1.13e-04, exec_forw/loops = 9.63e-01, exec_back/loops = 9.61e-01 error = 9.48e+02 * serial FFTW runtimes (no data decomposition at all): Flags: FFTW_ESTIMATE, FFTW_PRESERVE_INPUT tune_forw = 1.26e-04; tune_back = 7.99e-05, exec_forw/loops = 9.62e-01, exec_back/loops = 9.62e-01 error = 9.48e+02 Flags: FFTW_MEASURE, FFTW_PRESERVE_INPUT tune_forw = 1.29e-04; tune_back = 8.11e-05, exec_forw/loops = 9.61e-01, exec_back/loops = 9.64e-01 error = 9.48e+02
This is a brief version of r2c /c2r benchmark program.
There are issues. But I can already observe with a 384x384x384 mesh pfft is slower than fftw running on a single rank by about 10%.
I am not quite sure how to fix the large FFTW errors. Also needs a commandline flag to add PADDED support.