Open musoke opened 3 years ago
Thanks for noticing this issue.
After doing some tests, it seems that the performance issue is not related to the FFTW flags, which are properly passed, but to the overhead caused by data copies and by local data transpositions (basically permutedims!
).
Here is an updated example showing this:
using BenchmarkTools
using FFTW
using LinearAlgebra
using MPI
using PencilFFTs
using Random
using TimerOutputs
using PencilArrays
TimerOutputs.enable_debug_timings(PencilFFTs)
TimerOutputs.enable_debug_timings(PencilArrays)
TimerOutputs.enable_debug_timings(Transpositions)
function fft_inplace!(a, b, fft_plan)
mul!(b, fft_plan, a)
end
function runtests()
suite = BenchmarkGroup()
if !MPI.Initialized()
MPI.Init()
end
n = 128
dims = (n, n, n)
timers = Dict{String,TimerOutput}()
suite["PencilFFTs"] = BenchmarkGroup()
for (flag_name, flags) in [
("FFTW.ESTIMATE", FFTW.ESTIMATE),
("FFTW.MEASURE", FFTW.MEASURE),
# ("FFTW.PATIENT", FFTW.PATIENT),
]
@info "PencilFFTs -- $flag_name"
# Distribute 12 processes on a 1 × 1 grid.
comm = MPI.COMM_WORLD
Nproc = MPI.Comm_size(comm)
# Let MPI_Dims_create choose the decomposition.
proc_dims = let pdims = zeros(Int, 2)
MPI.Dims_create!(Nproc, pdims)
pdims[1], pdims[2]
end
transform = Transforms.FFT()
fft_plan = PencilFFTPlan(
dims, transform, proc_dims, comm;
fftw_flags = flags,
permute_dims = Val(true),
)
timers[flag_name] = timer(fft_plan)
# Allocate and initialise input data, and apply transform.
a_pen = allocate_input(fft_plan)
b_pen = allocate_output(fft_plan)
randn!(a_pen)
randn!(b_pen)
suite["PencilFFTs"][flag_name] = @benchmarkable fft_inplace!($a_pen, $b_pen, $fft_plan) seconds=1.0 samples=100
end
suite["FFTW"] = BenchmarkGroup()
for (flag_name, flags) in [
("FFTW.ESTIMATE", FFTW.ESTIMATE),
("FFTW.MEASURE", FFTW.MEASURE),
# ("FFTW.PATIENT", FFTW.PATIENT),
]
@info "FFTW -- $flag_name"
a_arr = Array{ComplexF64}(undef, dims)
b_arr = Array{ComplexF64}(undef, dims)
rand!(a_arr)
rand!(b_arr)
# FFTW.set_num_threads(1)
fft_plan = plan_fft(a_arr; flags=flags)
suite["FFTW"][flag_name] = @benchmarkable fft_inplace!($a_arr, $b_arr, $fft_plan) seconds=1.0 samples=100
end
@show tune!(suite)
for to in values(timers)
reset_timer!(to)
end
results = run(suite)
@show minimum(results)
println(timers["FFTW.ESTIMATE"])
nothing
end
runtests()
On my machine, the BenchmarkTools output gives the following:
"PencilFFTs" => 2-element BenchmarkTools.BenchmarkGroup:
tags: []
"FFTW.ESTIMATE" => TrialEstimate(63.527 ms)
"FFTW.MEASURE" => TrialEstimate(60.602 ms)
"FFTW" => 2-element BenchmarkTools.BenchmarkGroup:
tags: []
"FFTW.ESTIMATE" => TrialEstimate(25.541 ms)
"FFTW.MEASURE" => TrialEstimate(24.178 ms)
which suggest that FFTW flags are not the issue. This is confirmed by the TimerOutputs timings:
───────────────────────────────────────────────────────────────────────────────
Time Allocations
────────────────────── ───────────────────────
Tot / % measured: 7.83s / 13.3% 451KiB / 34.9%
Section ncalls time %tot avg alloc %tot avg
───────────────────────────────────────────────────────────────────────────────
PencilFFTs mul! 11 1.04s 100% 94.8ms 157KiB 100% 14.3KiB
transpose! 22 564ms 54.1% 25.6ms 20.0KiB 12.7% 930B
unpack data 22 336ms 32.2% 15.3ms 8.22KiB 5.23% 383B
copy_permuted! 22 336ms 32.2% 15.3ms 7.39KiB 4.70% 344B
pack data 22 228ms 21.9% 10.4ms 1.52KiB 0.96% 70.5B
copy_range! 22 228ms 21.8% 10.4ms 0.00B 0.00% 0.00B
FFT 33 478ms 45.8% 14.5ms 0.00B 0.00% 0.00B
MPI.Waitall! 22 172μs 0.02% 7.83μs 4.47KiB 2.84% 208B
───────────────────────────────────────────────────────────────────────────────
Half of the time is spent copying data around and permuting dimensions via permutedims!
. It should be possible to avoid the unnecessary overhead when running on a single process...
Ok, this makes sense, thanks for the explanation!
I was misled by the times with FFTW.ESTIMATE coincidentally being the same.
On Tue, 22 Jun 2021, 05:02 Juan Ignacio Polanco, @.***> wrote:
Thanks for noticing this issue.
After doing some tests, it seems that the performance issue is not related to the FFTW flags, which are properly passed, but to the overhead caused by the local data transpositions (basically permutedims!).
Here is an updated example showing this:
using BenchmarkTools using FFTW using LinearAlgebra using MPI using PencilFFTs using Random using TimerOutputs using PencilArrays
TimerOutputs.enable_debug_timings(PencilFFTs)
TimerOutputs.enable_debug_timings(PencilArrays)
TimerOutputs.enable_debug_timings(Transpositions)
function fft_inplace!(a, b, fft_plan)
mul!(b, fft_plan, a)
end
function runtests()
suite = BenchmarkGroup() if !MPI.Initialized() MPI.Init() end n = 128 dims = (n, n, n) timers = Dict{String,TimerOutput}() suite["PencilFFTs"] = BenchmarkGroup() for (flag_name, flags) in [ ("FFTW.ESTIMATE", FFTW.ESTIMATE), ("FFTW.MEASURE", FFTW.MEASURE), # ("FFTW.PATIENT", FFTW.PATIENT), ] @info "PencilFFTs -- $flag_name" # Distribute 12 processes on a 1 × 1 grid. comm = MPI.COMM_WORLD Nproc = MPI.Comm_size(comm) # Let MPI_Dims_create choose the decomposition. proc_dims = let pdims = zeros(Int, 2) MPI.Dims_create!(Nproc, pdims) pdims[1], pdims[2] end transform = Transforms.FFT() fft_plan = PencilFFTPlan( dims, transform, proc_dims, comm; fftw_flags = flags, permute_dims = Val(true), ) timers[flag_name] = timer(fft_plan) # Allocate and initialise input data, and apply transform. a_pen = allocate_input(fft_plan) b_pen = allocate_output(fft_plan) randn!(a_pen) randn!(b_pen) suite["PencilFFTs"][flag_name] = @benchmarkable fft_inplace!($a_pen, $b_pen, $fft_plan) seconds=1.0 samples=100 end suite["FFTW"] = BenchmarkGroup() for (flag_name, flags) in [ ("FFTW.ESTIMATE", FFTW.ESTIMATE), ("FFTW.MEASURE", FFTW.MEASURE), # ("FFTW.PATIENT", FFTW.PATIENT), ] @info "FFTW -- $flag_name" a_arr = Array{ComplexF64}(undef, dims) b_arr = Array{ComplexF64}(undef, dims) rand!(a_arr) rand!(b_arr) # FFTW.set_num_threads(1) fft_plan = plan_fft(a_arr; flags=flags) suite["FFTW"][flag_name] = @benchmarkable fft_inplace!($a_arr, $b_arr, $fft_plan) seconds=1.0 samples=100 end @show tune!(suite) for to in values(timers) reset_timer!(to) end results = run(suite) @show minimum(results) println(timers["FFTW.ESTIMATE"]) nothing
end
runtests()
In my machine, the BenchmarkTools output gives the following:
"PencilFFTs" => 2-element BenchmarkTools.BenchmarkGroup:
tags: [] "FFTW.ESTIMATE" => TrialEstimate(63.527 ms) "FFTW.MEASURE" => TrialEstimate(60.602 ms)
"FFTW" => 2-element BenchmarkTools.BenchmarkGroup:
tags: [] "FFTW.ESTIMATE" => TrialEstimate(25.541 ms) "FFTW.MEASURE" => TrialEstimate(24.178 ms)
which suggest that FFTW flags are not the issue. This is confirmed by the TimerOutputs timings:
───────────────────────────────────────────────────────────────────────────────
Time Allocations ────────────────────── ─────────────────────── Tot / % measured: 7.83s / 13.3% 451KiB / 34.9%
Section ncalls time %tot avg alloc %tot avg
───────────────────────────────────────────────────────────────────────────────
PencilFFTs mul! 11 1.04s 100% 94.8ms 157KiB 100% 14.3KiB
transpose! 22 564ms 54.1% 25.6ms 20.0KiB 12.7% 930B
unpack data 22 336ms 32.2% 15.3ms 8.22KiB 5.23% 383B copy_permuted! 22 336ms 32.2% 15.3ms 7.39KiB 4.70% 344B pack data 22 228ms 21.9% 10.4ms 1.52KiB 0.96% 70.5B copy_range! 22 228ms 21.8% 10.4ms 0.00B 0.00% 0.00B
FFT 33 478ms 45.8% 14.5ms 0.00B 0.00% 0.00B
MPI.Waitall! 22 172μs 0.02% 7.83μs 4.47KiB 2.84% 208B
───────────────────────────────────────────────────────────────────────────────
Half of the time is spent copying data around and permuting dimensions via permutedims!. It should be possible to avoid the unnecessary overhead when running on a single process...
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jipolanco/PencilFFTs.jl/issues/35#issuecomment-865691501, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD7ET7GME4BIMESI3UDKEHLTUA7PRANCNFSM47CW634A .
I have found some odd benchmarks when using PencilFFTs with a single process. It's unclear to me whether this is expected behavior.
The following script compares FFTW and PencilFFTs with three different FFTW flags.
Typical output for me is
With FFTW.ESTIMATE, FFTW and PencilFFTs are on par. However, while using FFTW.MEASURE or FFTW.PATIENT makes the FFTW twice as fast, PencilFFTs is unaffected.
At first it seemed that this might be a problem with how the flags are passed to FFTW, but I don't see an obvious bug there. The correct flags do seem to be passed through to _make_1d_fft_plan.
Perhaps FFTW does some clever optimization of 3D FFTs that isn't possible when the transform is done one axis at a time?