Performance overhead when running on a single process

musoke commented 3 years ago

I have found some odd benchmarks when using PencilFFTs with a single process. It's unclear to me whether this is expected behavior.

The following script compares FFTW and PencilFFTs with three different FFTW flags.

using BenchmarkTools
using FFTW
using LinearAlgebra
using MPI
using PencilFFTs
using Random

function fft_inplace!(a, b, fft_plan)
    mul!(b, fft_plan, a)
end

function runtests()
    suite = BenchmarkGroup()

    if ~MPI.Initialized()
        MPI.Init()
    end

    n = 128
    dims = (n, n, n)

    suite["PencilFFTs"] = BenchmarkGroup()
    for (flag_name, flags) in [
                           ("FFTW.ESTIMATE", FFTW.ESTIMATE),
                           ("FFTW.MEASURE", FFTW.MEASURE),
                           ("FFTW.PATIENT", FFTW.PATIENT),
                          ]
        @info flag_name

        # Distribute 12 processes on a 1 × 1 grid.
        comm = MPI.COMM_WORLD
        Nproc = MPI.Comm_size(comm)

        # Let MPI_Dims_create choose the decomposition.
        proc_dims = let pdims = zeros(Int, 2)
            MPI.Dims_create!(Nproc, pdims)
            pdims[1], pdims[2]
        end

        transform = Transforms.FFT()

        fft_plan = PencilFFTPlan(dims, transform, proc_dims, comm; fftw_flags=flags)

        # Allocate and initialise input data, and apply transform.
        a_pen = allocate_input(fft_plan)
        b_pen = allocate_output(fft_plan)
        randn!(a_pen)
        randn!(b_pen)

        suite["PencilFFTs"][flag_name] = @benchmarkable fft_inplace!($a_pen, $b_pen, $fft_plan)
    end

    suite["FFTW"] = BenchmarkGroup()
    for (flag_name, flags) in [
                           ("FFTW.ESTIMATE", FFTW.ESTIMATE),
                           ("FFTW.MEASURE", FFTW.MEASURE),
                           ("FFTW.PATIENT", FFTW.PATIENT),
                          ]
        a_arr = Array{ComplexF64}(undef, dims)
        b_arr = Array{ComplexF64}(undef, dims)
        rand!(a_arr)
        rand!(b_arr)

        fft_plan = plan_fft(a_arr; flags=flags)

        suite["FFTW"][flag_name] = @benchmarkable fft_inplace!($a_arr, $b_arr, $fft_plan)
    end

    @show tune!(suite)

    results = run(suite)

    @show res_min = minimum(results)
end

runtests()

Typical output for me is

tune!(suite) = 2-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "PencilFFTs" => 3-element BenchmarkTools.BenchmarkGroup:
      tags: []
      "FFTW.PATIENT" => Benchmark(evals=1, seconds=5.0, samples=10000)
      "FFTW.ESTIMATE" => Benchmark(evals=1, seconds=5.0, samples=10000)
      "FFTW.MEASURE" => Benchmark(evals=1, seconds=5.0, samples=10000)
  "FFTW" => 3-element BenchmarkTools.BenchmarkGroup:
      tags: []
      "FFTW.PATIENT" => Benchmark(evals=1, seconds=5.0, samples=10000)
      "FFTW.ESTIMATE" => Benchmark(evals=1, seconds=5.0, samples=10000)
      "FFTW.MEASURE" => Benchmark(evals=1, seconds=5.0, samples=10000)
res_min = minimum(results) = 2-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "PencilFFTs" => 3-element BenchmarkTools.BenchmarkGroup:
      tags: []
      "FFTW.PATIENT" => TrialEstimate(68.955 ms)
      "FFTW.ESTIMATE" => TrialEstimate(69.221 ms)
      "FFTW.MEASURE" => TrialEstimate(69.830 ms)
  "FFTW" => 3-element BenchmarkTools.BenchmarkGroup:
      tags: []
      "FFTW.PATIENT" => TrialEstimate(37.748 ms)
      "FFTW.ESTIMATE" => TrialEstimate(75.944 ms)
      "FFTW.MEASURE" => TrialEstimate(38.887 ms)

With FFTW.ESTIMATE, FFTW and PencilFFTs are on par. However, while using FFTW.MEASURE or FFTW.PATIENT makes the FFTW twice as fast, PencilFFTs is unaffected.

At first it seemed that this might be a problem with how the flags are passed to FFTW, but I don't see an obvious bug there. The correct flags do seem to be passed through to _make_1d_fft_plan.

Perhaps FFTW does some clever optimization of 3D FFTs that isn't possible when the transform is done one axis at a time?

jipolanco commented 3 years ago

Thanks for noticing this issue.

After doing some tests, it seems that the performance issue is not related to the FFTW flags, which are properly passed, but to the overhead caused by data copies and by local data transpositions (basically permutedims!).

Here is an updated example showing this:

using BenchmarkTools
using FFTW
using LinearAlgebra
using MPI
using PencilFFTs
using Random
using TimerOutputs
using PencilArrays

TimerOutputs.enable_debug_timings(PencilFFTs)
TimerOutputs.enable_debug_timings(PencilArrays)
TimerOutputs.enable_debug_timings(Transpositions)

function fft_inplace!(a, b, fft_plan)
    mul!(b, fft_plan, a)
end

function runtests()
    suite = BenchmarkGroup()

    if !MPI.Initialized()
        MPI.Init()
    end

    n = 128
    dims = (n, n, n)

    timers = Dict{String,TimerOutput}()

    suite["PencilFFTs"] = BenchmarkGroup()
    for (flag_name, flags) in [
                           ("FFTW.ESTIMATE", FFTW.ESTIMATE),
                           ("FFTW.MEASURE", FFTW.MEASURE),
                           # ("FFTW.PATIENT", FFTW.PATIENT),
                          ]
        @info "PencilFFTs -- $flag_name"

        # Distribute 12 processes on a 1 × 1 grid.
        comm = MPI.COMM_WORLD
        Nproc = MPI.Comm_size(comm)

        # Let MPI_Dims_create choose the decomposition.
        proc_dims = let pdims = zeros(Int, 2)
            MPI.Dims_create!(Nproc, pdims)
            pdims[1], pdims[2]
        end

        transform = Transforms.FFT()

        fft_plan = PencilFFTPlan(
            dims, transform, proc_dims, comm;
            fftw_flags = flags,
            permute_dims = Val(true),
        )

        timers[flag_name] = timer(fft_plan)

        # Allocate and initialise input data, and apply transform.
        a_pen = allocate_input(fft_plan)
        b_pen = allocate_output(fft_plan)
        randn!(a_pen)
        randn!(b_pen)

        suite["PencilFFTs"][flag_name] = @benchmarkable fft_inplace!($a_pen, $b_pen, $fft_plan) seconds=1.0 samples=100
    end

    suite["FFTW"] = BenchmarkGroup()
    for (flag_name, flags) in [
                           ("FFTW.ESTIMATE", FFTW.ESTIMATE),
                           ("FFTW.MEASURE", FFTW.MEASURE),
                           # ("FFTW.PATIENT", FFTW.PATIENT),
                          ]
        @info "FFTW -- $flag_name"
        a_arr = Array{ComplexF64}(undef, dims)
        b_arr = Array{ComplexF64}(undef, dims)
        rand!(a_arr)
        rand!(b_arr)

        # FFTW.set_num_threads(1)
        fft_plan = plan_fft(a_arr; flags=flags)

        suite["FFTW"][flag_name] = @benchmarkable fft_inplace!($a_arr, $b_arr, $fft_plan) seconds=1.0 samples=100
    end

    @show tune!(suite)

    for to in values(timers)
        reset_timer!(to)
    end

    results = run(suite)

    @show minimum(results)

    println(timers["FFTW.ESTIMATE"])

    nothing
end

runtests()

On my machine, the BenchmarkTools output gives the following:

  "PencilFFTs" => 2-element BenchmarkTools.BenchmarkGroup:
          tags: []
          "FFTW.ESTIMATE" => TrialEstimate(63.527 ms)
          "FFTW.MEASURE" => TrialEstimate(60.602 ms)
  "FFTW" => 2-element BenchmarkTools.BenchmarkGroup:
          tags: []
          "FFTW.ESTIMATE" => TrialEstimate(25.541 ms)
          "FFTW.MEASURE" => TrialEstimate(24.178 ms)

which suggest that FFTW flags are not the issue. This is confirmed by the TimerOutputs timings:

 ───────────────────────────────────────────────────────────────────────────────
                                        Time                   Allocations      
                                ──────────────────────   ───────────────────────
        Tot / % measured:            7.83s / 13.3%            451KiB / 34.9%    

 Section                ncalls     time   %tot     avg     alloc   %tot      avg
 ───────────────────────────────────────────────────────────────────────────────
 PencilFFTs mul!            11    1.04s   100%  94.8ms    157KiB  100%   14.3KiB
   transpose!               22    564ms  54.1%  25.6ms   20.0KiB  12.7%     930B
     unpack data            22    336ms  32.2%  15.3ms   8.22KiB  5.23%     383B
       copy_permuted!       22    336ms  32.2%  15.3ms   7.39KiB  4.70%     344B
     pack data              22    228ms  21.9%  10.4ms   1.52KiB  0.96%    70.5B
       copy_range!          22    228ms  21.8%  10.4ms     0.00B  0.00%    0.00B
   FFT                      33    478ms  45.8%  14.5ms     0.00B  0.00%    0.00B
   MPI.Waitall!             22    172μs  0.02%  7.83μs   4.47KiB  2.84%     208B
 ───────────────────────────────────────────────────────────────────────────────

Half of the time is spent copying data around and permuting dimensions via permutedims!. It should be possible to avoid the unnecessary overhead when running on a single process...

musoke commented 3 years ago

Ok, this makes sense, thanks for the explanation!

I was misled by the times with FFTW.ESTIMATE coincidentally being the same.

On Tue, 22 Jun 2021, 05:02 Juan Ignacio Polanco, @.***> wrote:

Thanks for noticing this issue.

After doing some tests, it seems that the performance issue is not related to the FFTW flags, which are properly passed, but to the overhead caused by the local data transpositions (basically permutedims!).

Here is an updated example showing this:

using BenchmarkTools using FFTW using LinearAlgebra using MPI using PencilFFTs using Random using TimerOutputs using PencilArrays

TimerOutputs.enable_debug_timings(PencilFFTs)

TimerOutputs.enable_debug_timings(PencilArrays)

TimerOutputs.enable_debug_timings(Transpositions)

function fft_inplace!(a, b, fft_plan)
mul!(b, fft_plan, a)
end

function runtests()
suite = BenchmarkGroup()

if !MPI.Initialized()

    MPI.Init()

end

n = 128

dims = (n, n, n)

timers = Dict{String,TimerOutput}()

suite["PencilFFTs"] = BenchmarkGroup()

for (flag_name, flags) in [

                       ("FFTW.ESTIMATE", FFTW.ESTIMATE),

                       ("FFTW.MEASURE", FFTW.MEASURE),

                       # ("FFTW.PATIENT", FFTW.PATIENT),

                      ]

    @info "PencilFFTs -- $flag_name"

    # Distribute 12 processes on a 1 × 1 grid.

    comm = MPI.COMM_WORLD

    Nproc = MPI.Comm_size(comm)

    # Let MPI_Dims_create choose the decomposition.

    proc_dims = let pdims = zeros(Int, 2)

        MPI.Dims_create!(Nproc, pdims)

        pdims[1], pdims[2]

    end

    transform = Transforms.FFT()

    fft_plan = PencilFFTPlan(

        dims, transform, proc_dims, comm;

        fftw_flags = flags,

        permute_dims = Val(true),

    )

    timers[flag_name] = timer(fft_plan)

    # Allocate and initialise input data, and apply transform.

    a_pen = allocate_input(fft_plan)

    b_pen = allocate_output(fft_plan)

    randn!(a_pen)

    randn!(b_pen)

    suite["PencilFFTs"][flag_name] = @benchmarkable fft_inplace!($a_pen, $b_pen, $fft_plan) seconds=1.0 samples=100

end

suite["FFTW"] = BenchmarkGroup()

for (flag_name, flags) in [

                       ("FFTW.ESTIMATE", FFTW.ESTIMATE),

                       ("FFTW.MEASURE", FFTW.MEASURE),

                       # ("FFTW.PATIENT", FFTW.PATIENT),

                      ]

    @info "FFTW -- $flag_name"

    a_arr = Array{ComplexF64}(undef, dims)

    b_arr = Array{ComplexF64}(undef, dims)

    rand!(a_arr)

    rand!(b_arr)

    # FFTW.set_num_threads(1)

    fft_plan = plan_fft(a_arr; flags=flags)

    suite["FFTW"][flag_name] = @benchmarkable fft_inplace!($a_arr, $b_arr, $fft_plan) seconds=1.0 samples=100

end

@show tune!(suite)

for to in values(timers)

    reset_timer!(to)

end

results = run(suite)

@show minimum(results)

println(timers["FFTW.ESTIMATE"])

nothing
end

runtests()

In my machine, the BenchmarkTools output gives the following:

"PencilFFTs" => 2-element BenchmarkTools.BenchmarkGroup:
      tags: []

      "FFTW.ESTIMATE" => TrialEstimate(63.527 ms)

      "FFTW.MEASURE" => TrialEstimate(60.602 ms)
"FFTW" => 2-element BenchmarkTools.BenchmarkGroup:
      tags: []

      "FFTW.ESTIMATE" => TrialEstimate(25.541 ms)

      "FFTW.MEASURE" => TrialEstimate(24.178 ms)
which suggest that FFTW flags are not the issue. This is confirmed by the TimerOutputs timings:

───────────────────────────────────────────────────────────────────────────────
                                    Time                   Allocations

                            ──────────────────────   ───────────────────────

    Tot / % measured:            7.83s / 13.3%            451KiB / 34.9%
Section ncalls time %tot avg alloc %tot avg

───────────────────────────────────────────────────────────────────────────────

PencilFFTs mul! 11 1.04s 100% 94.8ms 157KiB 100% 14.3KiB

transpose! 22 564ms 54.1% 25.6ms 20.0KiB 12.7% 930B
 unpack data            22    336ms  32.2%  15.3ms   8.22KiB  5.23%     383B

   copy_permuted!       22    336ms  32.2%  15.3ms   7.39KiB  4.70%     344B

 pack data              22    228ms  21.9%  10.4ms   1.52KiB  0.96%    70.5B

   copy_range!          22    228ms  21.8%  10.4ms     0.00B  0.00%    0.00B
FFT 33 478ms 45.8% 14.5ms 0.00B 0.00% 0.00B

MPI.Waitall! 22 172μs 0.02% 7.83μs 4.47KiB 2.84% 208B

───────────────────────────────────────────────────────────────────────────────

Half of the time is spent copying data around and permuting dimensions via permutedims!. It should be possible to avoid the unnecessary overhead when running on a single process...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jipolanco/PencilFFTs.jl/issues/35#issuecomment-865691501, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD7ET7GME4BIMESI3UDKEHLTUA7PRANCNFSM47CW634A .

jipolanco / PencilFFTs.jl

Performance overhead when running on a single process #35