Test against cudnn? - Githubissues

I'm curious what GPU performance you get against something like the cudnn wrappers of NNlibCUDA.jl. Those would be more appropriate comparisons than CuArray broadcasting.

Thanks, Chris, for your suggestion and your interest in ParallelStencil. We would be very happy to do some comparison against the cudnn wrappers. Do you have an example using cudnn that is similar to one of our ParallelStencil examples (e.g. heat diffusion, shallow ice...) which could serve as a good basis in order to create a maximally equivalent code for comparison?

BTW: Comparing against CuArray broadcasting makes also sense for us as because it is the standard way for doing array computations in Julia and it enables writing code syntactically very similar to codes written with ParallelStencil.

BTW: Comparing against CuArray broadcasting makes also sense for us as because it is the standard way for doing array computations in Julia and it enables writing code syntactically very similar to codes written with ParallelStencil.

I see that, but by far and away the most common way to do stencil calculations with CuArrays (and CUDA in general) is cudnn, so it's the baseline benchmark someone would look for. I don't have a nice testing code handy, but the docs are here:

https://fluxml.ai/Flux.jl/stable/models/nnlib/#NNlib.conv

You'd create the Laplacian weight matrix and send it to that.

We do not have any experience with cudnn - so, we will most certainly not be able to create an example using cudnn in its most performant way as needed for a meaningful comparison. Would you be willing to create such an example and we would do then the corresponding ParallelStencil example to compare against? If so, how about the shallow ice example - it is nonlinear and meaningful, but in the same time very simple and concise. Moreover, we should just compare the time for one iteration, else too many factors come into play (and we don't want to compare apples and oranges).

IIRC it's just: w = [0 1 0; 1 -4 1; 0 1 0], and then conv(A,w) would be the 2D laplacian. Then the 3D is just using a 3D weight vector. @chriselrod also has some stencil tooling in LoopVectorization.jl worth mentioning. @chriselrod did you test against cudnn yet?

Moreover, we should just compare the time for one iteration, else too many factors come into play (and we don't want to compare apples and oranges).

Yes, just comparing one iteration makes sense. Also, you might need to go a little lower into the wrapper to find the in-place conv operation.

IIRC it's just: w = [0 1 0; 1 -4 1; 0 1 0], and then conv(A,w) would be the 2D laplacian. Then the 3D is just using a 3D weight vector. @chriselrod also has some stencil tooling in LoopVectorization.jl worth mentioning. @chriselrod did you test against cudnn yet?

LoopVectorization is CPU-only, but the cross-over point would be interesting. I guess the way to test this would be JuliaHub?

I'll definitely have benchmarks for training a few different (small) neural networks with NNlibCPU vs NNlibGPU, to get an idea of where the crossing-over point is for at least a couple different CPUs and GPUs.

Yeah I was wondering on the CPU doing tests against LoopVectorization (since NNPACK is pretty bad) and GPU doing the tests vs cudnn. I think just doing the 3D Laplacian in all cases on 128x128x128 sounds reasonable to me. I'm swapping out to a new desktop in the next week so I won't be able to run it for a bit though.

Then the 3D is just using a 3D weight vector.

Do you mean something other than?

julia> [0;0;0;;0;1;0;;0;0;0;;;0;1;0;;1;-6;1;;0;1;0;;;0;0;0;;0;1;0;;0;0;0] # Julia 1.7 syntax
3×3×3 Array{Int64, 3}:
[:, :, 1] =
 0  0  0
 0  1  0
 0  0  0

[:, :, 2] =
 0   1  0
 1  -6  1
 0   1  0

[:, :, 3] =
 0  0  0
 0  1  0
 0  0  0

(Or alternatively, the 27-point stencil? With the former "7-point" version, we may also want to just hard code it and loop.

For 128x128x128, we should probably also try a range of batch sizes. Although there wouldn't be a point if the GPU is already faster with a batch size of 1.

I don't have a usable GPU, so I'd probably have to run that part on arctic.

(Or alternatively, the 27-point stencil? With the former "7-point" version, we may also want to just hard code it and loop.

A 7-point stencil should be fine to start with. Maybe one could do both 2D and 3D. Domain resolution may be adapted upwards depending on what's needed to saturate the memory bandwidth (mainly on the GPU).

ParallelStencil implements Threads as CPU backend. So one could indeed compare LoopVectorzation.jl vs it, i.e. comparing Threads vs @avx instructions. Although, I saw there is some news like @avx thread=true which, if ready, could be tried out in ParallelStencil at some point.

I think just doing the 3D Laplacian in all cases

Yes, Laplacian (2D and 3D) should be fine. Would be relevant thought to have the tests done with variable (e.g. diffusion) coefficient though as closer to real applications.

Here are a couple benchmarks.

First, 3d laplace on 128x128x128x1x1 -> 126x126x126x1x1:

using LoopVectorization, CUDA, NNlib, NNlibCUDA, BenchmarkTools
function laplace_sparse!(
  out::AbstractArray{<:Any,3}, img::AbstractArray{<:Any,3}
)
  @tturbo for j₁ ∈ axes(out,1), j₂ ∈ axes(out,2), j₃ ∈ axes(out,3)
    s_0 =        img[j₁,     j₂ + 1, j₃ + 1]
    s_1 = s_0 +  img[j₁ + 2, j₂ + 1, j₃ + 1]
    s_2 = s_1 +  img[j₁ + 1, j₂,     j₃ + 1]
    s_3 = s_2 +  img[j₁ + 1, j₂ + 2, j₃ + 1]
    s_4 = s_3 +  img[j₁ + 1, j₂ + 1, j₃    ]
    s_5 = s_4 +  img[j₁ + 1, j₂ + 1, j₃ + 2]
    s_6 = s_5 - 6img[j₁ + 1, j₂ + 1, j₃ + 1]
    out[j₁, j₂, j₃] = s_6
  end
end

laplace_kern = reshape(cat(
  Float32[0 0 0; 0  1 0; 0 0 0],
  Float32[0 1 0; 1 -6 1; 0 1 0],
  Float32[0 0 0; 0  1 0; 0 0 0];
  dims=3
), (3,3,3,1,1));
img3d = rand(Float32, 128, 128, 128, 1, 1);
dcdlaplace = NNlib.DenseConvDims(img3d, laplace_kern, flipkernel = true);

img3d_squeezed = reshape(img3d, size(img3d)[1:3]);
out3d = Array{Float32}(undef, size(img3d_squeezed) .+ 1 .- size(laplace_kern)[1:3]);

cuimg3d = CuArray(img3d);
cuout3d = CuArray{Float32}(undef, size(img3d,1)+1-size(laplace_kern,1), size(img3d,2)+1-size(laplace_kern,2), size(img3d,3)+1-size(laplace_kern,3), size(laplace_kern,5), size(img3d,5));
culaplace_kern = CuArray(laplace_kern);

@time laplace_sparse!(out3d, img3d_squeezed);
CUDA.@time NNlib.conv!(cuout3d, cuimg3d, culaplace_kern, dcdlaplace);

reshape(Array(cuout3d), size(out3d)) ≈ out3d

@benchmark laplace_sparse!($out3d, $img3d_squeezed)
@benchmark CUDA.@sync NNlib.conv!($cuout3d, $cuimg3d, $culaplace_kern, $dcdlaplace)

Running on arctic3, I get:

julia> @benchmark laplace_sparse!($out3d, $img3d_squeezed)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  129.043 μs … 591.036 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     139.638 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   139.897 μs ±  10.116 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                 ▃▇█▇▄▂
  ▂▂▃▅▆▅▃▂▂▂▂▃▃▅▇██████▇▅▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▁▁▁▂▁▂▁▁▂▂▂▁▁▂ ▃
  129 μs           Histogram: frequency by time          166 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark CUDA.@sync NNlib.conv!($cuout3d, $cuimg3d, $culaplace_kern, $dcdlaplace)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  211.893 μs …  5.860 ms  ┊ GC (min … max): 0.00% … 88.05%
 Time  (median):     276.628 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   278.146 μs ± 56.924 μs  ┊ GC (mean ± σ):  0.19% ±  0.88%

  ▁ ▂                     ▁  ▁  ▁▁▂▅▄▅█▇▄▅▅▂▁▅▃▂     ▁▁        ▂
  █▅█▇▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁█▅▃██▄█████████████████▆▆▅▆██▆▅▄▄▃▁█ █
  212 μs        Histogram: log(frequency) by time       317 μs <

 Memory estimate: 3.52 KiB, allocs estimate: 138.

julia> @benchmark laplace_sparse!($out3d, $img3d_squeezed)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):   68.660 μs …  2.308 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):      87.495 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   101.762 μs ± 36.557 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▇▆▄    ▂█                        ▁▅▃  ▃▁
  ▂▄████▇▅▄▃███▂▂▂▂▂▂▂▃▄▂▂▂▂▂▁▂▂▁▂▂▁▂▃███▃▄██▄▂▂▂▂▂▁▂▂▁▂▁▁▁▁▂▂ ▃
  68.7 μs         Histogram: frequency by time          167 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Not sure why the CPU time is so inconsistent from run to run, I get either, switching between results like the first vs the second from run to run.

Running the CPU part on my 10980XE yields much more consistent results:

julia> @benchmark laplace_sparse!($out3d, $img3d_squeezed)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  36.103 μs … 195.001 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     38.425 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   38.488 μs ±   1.829 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                      ▁▂▂▃▄▅██▆▇▇▆▆▅▅▅▄▃▂▁▁
  ▁▁▁▁▂▂▂▂▃▃▃▃▄▄▅▅▆▆▆▇██████████████████████▇▆▆▄▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁
  36.1 μs            Histogram: frequency by time            41.1 μs (top 1%)

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark laplace_sparse!($out3d, $img3d_squeezed)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  36.041 μs … 197.193 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     38.378 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   38.405 μs ±   1.809 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                         ▁▁▂▄▃▄▅█████▇▇▇▇▅▅▄▃▃▂▁▁
  ▁▁▁▁▁▂▂▂▃▃▃▃▃▄▄▄▄▅▅▅▆█▇████████████████████████▇▇▆▅▅▄▃▃▃▃▃▂▂▂▂▁▂▂▁
  36 μs              Histogram: frequency by time            40.5 μs (top 1%)

 Memory estimate: 0 bytes, allocs estimate: 0.

The benchmark I'd been using for convolutions is a 2d, with a 5x5 kernel mapping 3 inputs to 6 outputs, using 260x260 -> 256x256 images:

julia> dcd
DenseConvDims: (260, 260, 3) * (5, 5) -> (256, 256, 6), stride: (1, 1), pad: (0, 0, 0, 0), dil: (1, 1), flip: true

Script:

using NNlib, Polyester, LoopVectorization, Static

function kernaxes(::DenseConvDims{2,K,C_in, C_out}) where {K,C_in, C_out} # LoopVectorization can take advantage of static size information
  K₁ =  StaticInt(1):StaticInt(K[1])
  K₂ =  StaticInt(1):StaticInt(K[2])
  Cᵢₙ =  StaticInt(1):StaticInt(C_in)
  Cₒᵤₜ = StaticInt(1):StaticInt(C_out)
  (K₁, K₂, Cᵢₙ, Cₒᵤₜ)
end
function convlayer!(
  out::AbstractArray{<:Any,4}, img, kern,
  dcd::DenseConvDims{2, <:Any, <:Any, <:Any, (1, 1), (0, 0, 0, 0), (1, 1), true}
)
  @batch for d ∈ axes(out,4)
    (K₁, K₂, Cᵢₙ, Cₒᵤₜ) = kernaxes(dcd)
    for o ∈ Cₒᵤₜ
      @turbo for j₁ ∈ axes(out,1), j₂ ∈ axes(out,2)
        s = zero(eltype(out))
        for k₁ ∈ K₁, k₂ ∈ K₂, i ∈ Cᵢₙ
          s += img[j₁ + k₁ - 1, j₂ + k₂ - 1, i, d] * kern[k₁, k₂, i, o]
        end
        out[j₁, j₂, o, d] = s
      end
    end
  end
end
img = rand(Float32, 260, 260, 3, 100);
kern = rand(Float32, 5, 5, 3, 6);
out1 = Array{Float32}(undef, size(img,1)+1-size(kern,1), size(img,2)+1-size(kern,2), size(kern,4), size(img,4));
out2 = similar(out1);
dcd = NNlib.DenseConvDims(img, kern, flipkernel = true);
@time NNlib.conv!(out1, img, kern, dcd);
@time convlayer!(out2, img, kern, dcd);
out1 ≈ out2

cuout = CuArray(out1);
cuimg = CuArray(img);
cukern = CuArray(kern);
CUDA.@time NNlib.conv!(cuout, cuimg, cukern, dcd);

Array(cuout) ≈ out2

@benchmark NNlib.conv!($out1, $img, $kern, $dcd)
@benchmark convlayer!($out1, $img, $kern, $dcd)
@benchmark CUDA.@sync NNlib.conv!($cuout, $cuimg, $cukern, $dcd)

Results on arctic3:

julia> @benchmark NNlib.conv!($out1, $img, $kern, $dcd)
BechmarkTools.Trial: 13 samples with 1 evaluations.
 Range (min … max):  364.162 ms … 476.280 ms  ┊ GC (min … max): 0.00% … 17.16%
 Time  (median):     384.447 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   409.685 ms ±  43.502 ms  ┊ GC (mean ± σ):  6.24% ±  8.39%

  ▁    ▁▁ ▁▁▁▁     ▁    ▁                              ▁  ▁   █▁
  █▁▁▁▁██▁████▁▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁█▁▁▁█ ▁
  364 ms           Histogram: frequency by time          476 ms <

 Memory estimate: 375.01 MiB, allocs estimate: 117.

julia> @benchmark convlayer!($out1, $img, $kern, $dcd)
BechmarkTools.Trial: 474 samples with 1 evaluations.
 Range (min … max):   7.617 ms … 72.980 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):      8.559 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   10.492 ms ±  4.004 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆      █▅  ▃                                           ▁▅▅▄
  █▆▆▄▁▄███▁██▅▅▄▁▁▁▄▁▁▁▁▁▁▁▁▁▄▄▁▁▄▄▁▁▄▁▄▄▄▇▅▁▁▄▁▁▁▁▁▁▁▄▁████ ▇
  7.62 ms      Histogram: log(frequency) by time      14.5 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark CUDA.@sync NNlib.conv!($cuout, $cuimg, $cukern, $dcd)
BechmarkTools.Trial: 667 samples with 1 evaluations.
 Range (min … max):  5.050 ms …   9.009 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.497 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.491 ms ± 278.196 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                               █▇
  ▂▁▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▃▄▃▃██▂▆▃▁▂▁▂▁▂▂▂▂ ▂
  550 ms          Histogram: frequency by time         8.2 ms <

 Memory estimate: 64.44 KiB, allocs estimate: 4043.

CPU results on my 10980XE:

julia> @benchmark NNlib.conv!($out1, $img, $kern, $dcd)
BechmarkTools.Trial: 26 samples with 1 evaluations.
 Range (min … max):  189.316 ms … 208.082 ms  ┊ GC (min … max): 1.68% … 0.00%
 Time  (median):     197.510 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   197.727 ms ±   4.325 ms  ┊ GC (mean ± σ):  0.49% ± 0.73%

  ▁   ▁      ▁ ▁ ▁▁     ▁▁ ██ ▁ █ ▁ ▁▁ ▁▁ ▁▁▁         ▁    ▁        ▁
  █▁▁▁█▁▁▁▁▁▁█▁█▁██▁▁▁▁▁██▁██▁█▁█▁█▁██▁██▁███▁▁▁▁▁▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁█
  189 ms             Histogram: frequency by time              208 ms (top 1%)

 Memory estimate: 337.51 MiB, allocs estimate: 105.

julia> @benchmark convlayer!($out1, $img, $kern, $dcd)
BechmarkTools.Trial: 925 samples with 1 evaluations.
 Range (min … max):  5.295 ms …  5.903 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.398 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.395 ms ± 38.166 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                            ▁▃▂▃▁▃▂▁▃▂▄▄▃▅█▅█▃▃▇▁▂  ▂▂
  ▂▂▂▂▃▁▂▃▃▄▅▄▅▄▄▆▄▅██▆▅▇▆█████████████████████████▇███▅▅▆▆▅▄▄▄▃▃▂
  5.29 ms           Histogram: frequency by time           5.47 ms (top 1%)

 Memory estimate: 0 bytes, allocs estimate: 0.

At these sizes, the CPUs seem much faster than the GPU.

julia> CUDA.device()
CuDevice(0): Tesla T4

julia> CUDA.device()
CuDevice(0): Tesla T4

julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz

Arctic seems to be a dual socket system, with 2x Xeon Silver 4114s (based on Sys.CPU_THREADS === 40). The Xeon silver can do one 512 bit FMA/cycle. It has up to 6 memory channels, so 12 total between the two sockets. The 10980XE is a single socket, 18 core CPU. Its clock speed is much higher, and it is capable of two 512 bit FMA/cycle. It has only 4 memory channels.

julia> versioninfo()
Julia Version 1.6.2-pre.2
Commit ff1827d117* (2021-05-02 02:37 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz

The Tesla T4 has 2560 "cores", and sells for over twice as much as the 10980XE.

You forgot ParallelStencil.jl haha. The code should be something like:

using ParallelStencil.FiniteDifferences3D
#(...)
@parallel function diffusion3D_step!(T2, T, Ci, lam)
    @inn(T2) = (lam*@inn(Ci)*(@d2_xi(T) + @d2_yi(T) + @d2_zi(T)));
    return
end

I'm omitting the dx^2,dy^2,dz^2 to make it match.

Yes, Laplacian (2D and 3D) should be fine. Would be relevant thought to have the tests done with variable (e.g. diffusion) coefficient though as closer to real applications.

Many stencil compilers like cudnn only allow constant coefficients. I assumed that would be the limitation here as well, but if that wasn't specialized on then cudnn would probably be a lot faster but only apply to more limited scenarios. That is probably worth mentioning then as a comparison in the docs.

using ParallelStencil, ParallelStencil.FiniteDifferences3D
@init_parallel_stencil(Threads, Float64, 3);
@parallel function diffusion3D_step!(T2, T)
    @inn(T2) = @d2_xi(T) + @d2_yi(T) + @d2_zi(T)
    return
end

out3d_pad = similar(img3d_squeezed);
@time @parallel diffusion3D_step!(out3d_pad, img3d_squeezed)

@time laplace_sparse!(out3d, img3d_squeezed); 

subaxes = map(x -> UnitRange(x) .+ 1, axes(out3d))
out3d ≈ view(out3d_pad, subaxes...)

On the 10980XE:

julia> @benchmark @parallel diffusion3D_step!($out3d_pad, $img3d_squeezed)
BechmarkTools.Trial: 6145 samples with 1 evaluations.
 Range (min … max):  797.289 μs … 1.032 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     809.144 μs             ┊ GC (median):    0.00%
 Time  (mean ± σ):   809.137 μs ± 6.676 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                   ▂  ▁▃▁▂▂▂▂  ▃▆█▅▂
  ▁▁▁▁▁▂▃▃▂▁▂▂▂▂▃▅██▇▇███████▇███████████▇▅▅▄▃▃▂▃▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁
  797 μs            Histogram: frequency by time             823 μs (top 1%)

 Memory estimate: 9.08 KiB, allocs estimate: 110.

julia> @benchmark laplace_sparse!($out3d, $img3d_squeezed)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  35.854 μs … 189.335 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     37.263 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   37.368 μs ±   1.675 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                ▃▃▅▇▆▇██▇▇▇▅▆▅▅▄▄▅▃▄▃▃▁▁
  ▁▁▁▁▁▁▂▂▃▄▄▆████████████████████████████▇▇▆▅▅▄▄▃▃▃▂▃▂▂▂▂▂▂▁▂▂▁▁▂▁▁
  35.9 μs            Histogram: frequency by time            39.4 μs (top 1%)

 Memory estimate: 0 bytes, allocs estimate: 0.

So @tturbo is about 20x faster.

On the T4 GPU:

using ParallelStencil, ParallelStencil.FiniteDifferences3D
@init_parallel_stencil(CUDA, Float64, 3);

cuout3d_pad = CuArray(out3d_pad);
cuimg3d = CuArray(img3d_squeezed);
@time @parallel diffusion3D_step!(cuout3d_pad, cuimg3d)

out3d ≈ Array(view(cuout3d_pad, subaxes...))
@benchmark @parallel diffusion3D_step!($cuout3d_pad, $cuimg3d)

Results:

julia> @benchmark @parallel diffusion3D_step!($cuout3d_pad, $cuimg3d) # GPU
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  110.729 μs … 215.262 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     154.285 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   154.545 μs ±  10.290 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                 ▇▁▅█
  ▃▂▂▂▂▂▁▁▂▁▂▁▁▁▂▃▂▂▁▁▁▁▂▁▁▁▁▁▃▃▆████▇▆▄▃▃▂▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂ ▃
  111 μs           Histogram: frequency by time          189 μs <

 Memory estimate: 7.67 KiB, allocs estimate: 195.

julia> @benchmark laplace_sparse!($out3d, $img3d_squeezed) # @tturbo, CPU
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  69.342 μs …  4.140 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     89.844 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   94.171 μs ± 70.249 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▃▁     ▁█     ▄▇▂
  ▂▃▅███▅▃▃▄▃██▆▃▄▃▆███▅▃▂▂▂▃▃▃▂▂▂▁▁▁▂▂▂▁▁▂▂▂▂▂▃▆▆▃▂▂▂▅▆▅▃▂▂▂ ▃
  69.3 μs         Histogram: frequency by time         140 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

So about a 2x improvement over the NNlib version (which had unused batch and output dims), and much faster than ParallelStencil on the CPU, but much slower than @tturbo: >3x slower than the 10980XE and >1.5x slower than the dual-socket 4114.

Maybe @DhairyaLGandhi knows a switch on NNlib we should try here. But yeah, it looks promising on GPU but much less so on CPU.

Having it optionally apply @inbounds would probably help on the CPU. But the branches may prevent SIMD anyway:

if var"##ix, ParallelStencil#264" <= size(T2, 1) - 2 && (var"##iy, ParallelStencil#265" <= size(T2, 2) - 2 && var"##iz, ParallelStencil#266" <= size(T2, 3) - 2)
    #= /home/chriselrod/.julia/packages/ParallelStencil/0VULM/src/parallel.jl:103 =#
    T2[var"##ix, ParallelStencil#264" + 1, var"##iy, ParallelStencil#265" + 1, var"##iz, ParallelStencil#266" + 1] = ((T[(var"##ix, ParallelStencil#264" + 1) + 1, var"##iy, ParallelStencil#265" + 1, var"##iz, ParallelStencil#266" + 1] - T[var"##ix, ParallelStencil#264" + 1, var"##iy, ParallelStencil#265" + 1, var"##iz, ParallelStencil#266" + 1]) - (T[var"##ix, ParallelStencil#264" + 1, var"##iy, ParallelStencil#265" + 1, var"##iz, ParallelStencil#266" + 1] - T[(var"##ix, ParallelStencil#264" + 1) - 1, var"##iy, ParallelStencil#265" + 1, var"##iz, ParallelStencil#266" + 1])) + ((T[var"##ix, ParallelStencil#264" + 1, (var"##iy, ParallelStencil#265" + 1) + 1, var"##iz, ParallelStencil#266" + 1] - T[var"##ix, ParallelStencil#264" + 1, var"##iy, ParallelStencil#265" + 1, var"##iz, ParallelStencil#266" + 1]) - (T[var"##ix, ParallelStencil#264" + 1, var"##iy, ParallelStencil#265" + 1, var"##iz, ParallelStencil#266" + 1] - T[var"##ix, ParallelStencil#264" + 1, (var"##iy, ParallelStencil#265" + 1) - 1, var"##iz, ParallelStencil#266" + 1])) + ((T[var"##ix, ParallelStencil#264" + 1, var"##iy, ParallelStencil#265" + 1, (var"##iz, ParallelStencil#266" + 1) + 1] - T[var"##ix, ParallelStencil#264" + 1, var"##iy, ParallelStencil#265" + 1, var"##iz, ParallelStencil#266" + 1]) - (T[var"##ix, ParallelStencil#264" + 1, var"##iy, ParallelStencil#265" + 1, var"##iz, ParallelStencil#266" + 1] - T[var"##ix, ParallelStencil#264" + 1, var"##iy, ParallelStencil#265" + 1, (var"##iz, ParallelStencil#266" + 1) - 1]))
end

And yeah, I wouldn't be surprised if I misused CUDNN here / if there's a way to tell it about all those size(img,i) == 1 dimensions.

Thanks for reporting - interesting results. Looks like you did the ParallelStencil tests using Float64 while you have Float32 arrays for all other tests - or did I overlooked something?

One could also give a try to the @parallel_indices kernel formulation:

using BenchmarkTools, ParallelStencil, ParallelStencil.FiniteDifferences3D
@init_parallel_stencil(Threads, Float32, 3)
@parallel_indices (ix,iy,iz) function diffusion3D_step2!(T2::Data.Array, T::Data.Array)
    T2[ix,iy,iz] = T[ix+1,iy,iz] + T[ix-1,iy,iz] + 
                   T[ix,iy+1,iz] + T[ix,iy-1,iz] + 
                   T[ix,iy,iz+1] + T[ix,iy,iz-1] - 
                   6.0*T[ix,iy,iz]
    return
end
A = ones(Float32, 128, 128, 128)
B = similar(A)
@benchmark @parallel (2:size(A,1)-1, 2:size(A,2)-1, 2:size(A,3)-1) diffusion3D_step!($B, $A)

Having it optionally apply @inbounds would probably help on the CPU.

Yes, ParallelStencil does not include by default @inbounds on the kernel calculations. The codes are supposed to be launched using --check-bounds=no or to have @inbounds added.

For CPU execution, it could be interesting to see the perf one gets adding @tturbo instead of Threads.@threads in the "loop-replacing" macro e.g. here.

Thanks for reporting - interesting results. Looks like you did the ParallelStencil tests using Float64 while you have Float32 arrays for all other tests ?

Oops. Wasn't paying attention to the @init_parallel_stencil definition. Starting Julia with 18 threads and --checkbounds=no:

julia> @benchmark @parallel (2:size($A,1)-1, 2:size($A,2)-1, 2:size($A,3)-1) diffusion3D_step2!($B, $A)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):   88.396 μs …  4.841 ms  ┊ GC (min … max): 0.00% … 95.27%
 Time  (median):     132.150 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   119.220 μs ± 50.733 μs  ┊ GC (mean ± σ):  0.39% ±  0.95%

             ▃▅                                             ▂█▂
  ▂▂▂▂▃▃▄▅▄▃▄██▄▃▃▃▃▂▂▂▂▃▃▃▃▃▂▂▃▃▂▂▂▂▂▂▂▂▁▂▂▁▂▁▂▁▁▁▂▂▂▂▃▄▄▅▇███▅▄▄▄▃
  88.4 μs            Histogram: frequency by time             141 μs (top 1%)

 Memory estimate: 8.16 KiB, allocs estimate: 91.

julia> out3d = Array{Float32}(undef, size(A) .+ 1 .- 3);

julia> @benchmark laplace_sparse!($out3d, $A)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  35.796 μs … 200.787 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     37.115 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   37.193 μs ±   1.745 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                   ▂▃▄▄▆▇█▇██▇▆▇▅▅▄▂▁▁
  ▂▁▂▂▂▂▂▃▃▃▃▄▅▆▇▇█████████████████████▇▆▅▅▄▄▄▃▃▃▂▃▃▂▃▂▂▃▃▂▃▂▃▂▃▂▂▂▂
  35.8 μs            Histogram: frequency by time            39.1 μs (top 1%)

 Memory estimate: 0 bytes, allocs estimate: 0.

It's slower with 36 threads (the CPU has 18 cores/36 threads). Threads.@threads also tends to cause highly erratic timings, so that the median and mean are much slower than the minimum.

For CPU execution, it could be interesting to see the perf one gets adding

That'll require some more changes.

julia> @parallel_indices (ix,iy,iz) function diffusion3D_step2!(T2::Data.Array, T::Data.Array)
           T2[ix,iy,iz] = T[ix+1,iy,iz] + T[ix-1,iy,iz] +
                          T[ix,iy+1,iz] + T[ix,iy-1,iz] +
                          T[ix,iy,iz+1] + T[ix,iy,iz-1] -
                          6.0*T[ix,iy,iz]
           return
       end
ERROR: LoadError: Unrecognized loop range type: var"##ranges, ParallelStencil.ParallelKernel#257"[3].
Stacktrace:
 [1] register_single_loop!(ls::LoopVectorization.LoopSet, looprange::Expr)
   @ LoopVectorization ~/.julia/packages/LoopVectorization/o15wm/src/modeling/graphs.jl:930

LoopVectorization wants loops with simple indices.

CUDA performance:

julia> @benchmark @parallel (2:size($cuA,1)-1, 2:size($cuA,2)-1, 2:size($cuA,3)-1) diffusion3D_step2!($cuB, $cuA)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  87.434 μs … 473.461 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     90.375 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   93.133 μs ±   9.758 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃▃▆█▅▂▃      ▁▃       ▂         ▂ ▁▁              ▂          ▁
  ████████▆▅▄▅▅██▇▆▄▄▄▁██▆▆▅▄▁▄▁▁▄█▇██▅▅▃▃▁▃▁▁▁▃▁▃▁▆█▆▅▄▁▃▁▁▁█ █
  87.4 μs       Histogram: log(frequency) by time       138 μs <

 Memory estimate: 1.55 KiB, allocs estimate: 83.

Cool.

It's slower with 36 threads (the CPU has 18 cores/36 threads).

Yeah, somehow multi-threading perf is most optimal for number of threads being matching cores rather than threads.

That'll require some more changes.

What if one tries something like this (here without ParallelStencil):

function diffusion3D_step3!(T2::Data.Array, T::Data.Array)
    @tturbo for iz=2:size(T,3)-1
        for iy=2:size(T,2)-1
            for ix=2:size(T,1)-1
                T2[ix,iy,iz] = T[ix+1,iy,iz] + T[ix-1,iy,iz] + 
                               T[ix,iy+1,iz] + T[ix,iy-1,iz] + 
                               T[ix,iy,iz+1] + T[ix,iy,iz-1] -
                               6.0*T[ix,iy,iz]
            end
        end
    end
    return
end

One could also make a global index and thus a single index loop...

CUDA performance:

Nice. Problem may be a bit small for GPU though... Also, 3D stencil on GPU may need more advance implementation (like 2.5 D blocking) to get full perf...

I'm looking into why it's slower at the moment:

julia> @benchmark diffusion3D_step3!($B,$A)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  46.454 μs … 227.213 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     47.727 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   47.884 μs ±   2.009 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

              ▂▂▅▅▆█▇▇▆▅▃▃▁
  ▁▁▁▁▂▂▂▄▄▅▇███████████████▇▆▅▄▃▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▂▁▁▂▁▂▁▂▁▁▁
  46.5 μs            Histogram: frequency by time            50.8 μs (top 1%)

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark laplace_sparse!($out3d,$A)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  35.687 μs … 190.651 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     37.186 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   37.298 μs ±   1.728 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

              ▁▂▄▄▇▆▇█▅▆▆▅▅▅▄▄▅▃▃▃▄▃▃▂▂▁
  ▁▁▁▁▂▂▂▃▄▅▆▇███████████████████████████▆▇▆▆▆▅▄▃▃▃▃▂▂▂▂▂▂▁▂▁▂▂▁▁▁▁▁
  35.7 μs            Histogram: frequency by time            39.6 μs (top 1%)

 Memory estimate: 0 bytes, allocs estimate: 0.

If you loop at the definition of laplace_sparse! above, they're basically the same, the only notable difference being that its out3d is smaller:

julia> out3d = Array{Float32}(undef, size(A) .+ 1 .- 3);

julia> B = similar(A);

Which seems to be the cause, i.e. if we make out3d a view of the larger B, suddenly laplace_sparse! slows down to diffusion3D_step3!'s performance:

julia> out3d_view = view(B, 2:127, 2:127, 2:127);

julia> @benchmark laplace_sparse!($out3d_view,$A)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  45.549 μs … 230.740 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     46.609 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   46.759 μs ±   1.989 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

           ▁▂▄▄▇████▇▆▅▄▃▂▁
  ▁▁▂▁▂▃▄▆▇████████████████▇▇▆▅▄▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▁▁▂▁▁▂▁▁▁▁▁
  45.5 μs            Histogram: frequency by time            49.5 μs (top 1%)

 Memory estimate: 0 bytes, allocs estimate: 0.

My working theory is that it may have something to do with certain powers of 2 being magic slow numbers.

On the theory that the extra size is causing it to run out of cache,

The difference in size between out3d and B is less than 5%, (128 / 126)^3 = 1.0483789047659038.
The sizeof(A) + sizeof(B) is getting close to the total size of my CPU's L2 cache (1 MiB/core 18 cores). `sizeof(Float32) 2 * 128^3 / (2^20) == 16. While the total L2 cache size needed is going to be a bit greater than the combined size ofAandB(because cores need overlapping parts ofA`), it shouldn't be by such a large factor.

The bigger strike against the "too much memory" theory is if I shrink or increase the size of A, their performance matches, and we get better FLOPS than we did for the size(A) == (128,128,128) case, and performance is similar between all three @tturbos:

julia> A = ones(Float32, 126, 126, 126);

julia> B = similar(A);

julia> out3d = Array{Float32}(undef, size(A) .+ 1 .- 3);

julia> out3d_view = view(B, 2:size(B,1)-1, 2:size(B,2)-1, 2:size(B,3)-1);

julia> @benchmark diffusion3D_step3!($B,$A)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  26.384 μs … 168.274 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     27.583 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   27.679 μs ±   1.512 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                  ▂▄▅▇▇█▆▆▆▃▃▂▁
  ▂▁▁▁▂▂▂▂▃▃▄▄▅▆▇███████████████▆▆▅▄▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂
  26.4 μs            Histogram: frequency by time            29.9 μs (top 1%)

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark laplace_sparse!($out3d_view,$A)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  26.807 μs … 190.863 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     28.499 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   28.685 μs ±   1.852 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                  ▁▄▆▆▇█▇▅▆▅▃▃▃▁▁
  ▁▁▁▁▁▁▁▂▂▂▃▄▄▆██████████████████▇▆▆▄▅▄▃▃▃▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▁
  26.8 μs            Histogram: frequency by time            31.4 μs (top 1%)

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark laplace_sparse!($out3d,$A)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  27.010 μs … 146.895 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     30.255 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   30.137 μs ±   1.531 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                   ▂▄▅▆████▆▅▃▂
  ▁▁▁▁▂▂▂▂▂▂▂▃▃▂▃▃▃▃▃▃▃▃▄▄▅▅▅▆▆▇▇███████████████▇▅▄▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁
  27 μs              Histogram: frequency by time            32.7 μs (top 1%)

 Memory estimate: 0 bytes, allocs estimate: 0.

bigger:

julia> A = ones(Float32, 130, 130, 130);

julia> B = similar(A);

julia> out3d = Array{Float32}(undef, size(A) .+ 1 .- 3);

julia> out3d_view = view(B, 2:size(B,1)-1, 2:size(B,2)-1, 2:size(B,3)-1);

julia> @benchmark diffusion3D_step3!($B,$A)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  36.041 μs … 194.086 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     37.261 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   37.362 μs ±   1.692 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

               ▂▂▅▇▇▇██▇█▇▅▄▂▁
  ▁▁▁▁▂▂▂▃▃▄▆▇████████████████▇▇▆▄▄▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▂▂▁▁▁▂▁▁▁▁▁
  36 μs              Histogram: frequency by time            39.9 μs (top 1%)

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark laplace_sparse!($out3d_view,$A)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  37.286 μs … 187.904 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     38.715 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   38.837 μs ±   1.659 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

               ▁▂▃▆▆█▇███▅▅▄▃▂▁
  ▁▁▁▁▁▂▂▃▄▅▅▆█████████████████▇▇▆▄▄▄▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▂▁▂▂▁▁▁▂▁▁▁▁▁▁
  37.3 μs            Histogram: frequency by time            41.7 μs (top 1%)

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark laplace_sparse!($out3d,$A)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  34.597 μs … 207.342 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     35.437 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   35.531 μs ±   1.807 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

              ▂▃▅▇████▇█▅▄▃▂
  ▁▁▁▂▂▂▃▃▅▆▇█████████████████▇▆▅▄▄▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▁▁▁▁▁
  34.6 μs            Histogram: frequency by time            37.4 μs (top 1%)

 Memory estimate: 0 bytes, allocs estimate: 0.

These arrays are larger, but runtimes are the same as for the size(A) == (128,128,128) case.

TLDR: diffusion3D_step3! looks slower than laplace_sparse!, but that seems to be an artifact of the specific size we're benchmarking. They perform the same at other sizes.

Always benchmark at multiple sizes, and not just at power-of-2s.

Also, note that these require the latest release of VectorizationBase and/or LoopVectorization. The 5 arguments to + caused vararg non-specializing heuristics to trigger, making diffusion_3D_step3! perform badly. The latest VB release fixed this by forcing specialization, while the latest LV release fixed it by breaking up the 5-arg call into a chain of 2-arg calls. Hence either being on the latest release fixes the problem.

My working theory is that it may have something to do with certain powers of 2 being magic slow numbers.

From Agner Fog's C++ optimization manual, pages 89-90:

It is useful to know how a cache is organized if you are making programs that have big data structures with non-sequential access and you want to prevent cache contention. You may skip this section if you are satisfied with more heuristic guidelines.

Most caches are organized into lines andsets. Let me explain this with an example. My example is a cache of 8 kb size with a line size of 64 bytes. Each line covers 64 consecutive bytes of memory. One kilobyte is 1024 bytes, so we can calculate that the number of lines is 8*1024/64 = 128. Theselines are organized as 32 sets 4 ways. This means that a particular memory address cannot be loaded into an arbitrary cache line. Only one of the 32 sets can be used, but any of the 4 lines in the set can be used.We can calculate which set of cache lines to use for a particular memory address by the formula: (set) = (memory address) / (line size) % (number of sets). Here, / means integer division with truncation, and % means modulo. For example, if we want to readfrom memory address a= 10000, then we 90have (set) = (10000 / 64) % 32 = 28. This means that amust be read into one of the four cache lines in set number 28. The calculation becomes easier if we use hexadecimal numbers because all the numbers are powers of 2. Using hexadecimal numbers, we have a= 0x2710 and (set) = (0x2710 / 0x40) % 0x20 = 0x1C. Reading or writing a variable from address 0x2710 will cause the cache to load the entire 64 or 0x40 bytes from address 0x2700 to 0x273F into one of the four cache lines from set 0x1C. If the program afterwards reads or writes to any other address in this range then the value is already in the cache so we donot have to wait for another memory access.

Assume that a program reads from address 0x2710 and later reads from addresses 0x2F00, 0x3700, 0x3F00 and 0x4700. These addresses all belong to set number 0x1C. There are only four cache lines in each set. If the cache always chooses the least recently used cache line then the line that covered the address range from 0x2700 to 0x273F will be evicted when weread from 0x4700. Reading again from address 0x2710 will cause a cache miss. But if the program had read from different addresses with different set values then the line containing the address range from 0x2700 to 0x273F would still be in the cache. The problem only occurs because the addresses are spaced a multiple of 0x800 apart. I will call this distance the critical stride. Variables whose distance in memory is a multiple of the critical stride will contend for the same cache lines. The critical stridecan be calculated as (critical stride) = (number of sets) (line size) = (total cache size) / (number of ways).

If a program contains many variables and objects that are scattered around in memory,then there is a risk that several variables happen to be spaced by a multiple of the critical stride and cause contentions in the data cache. The same can happen in the code cache if there are many functions scattered around in program memory. If several functions that are used in the same part of the program happen to be spaced by a multiple of the critical stride then this can cause contentions in the code cache. The subsequent sections describe various ways to avoid these problems.

More details about how caches work can be found in Wikipedia under CPU cache(en.wikipedia.org/wiki/L2_cache).

The details of cache organization for different processors are covered in manual 3: "The microarchitecture of Intel, AMD and VIA CPUs".

The 10980XE and Xeon 4114 both have 1 MiB of L2 cache, 16 ways. This means the L2 critical stride is

julia> 2^20 ÷ 16
65536

bytes. LoopVectorization is unrolling the third axis. Successive elements on the third axis, for a 128x128xN dense array of Float32s are...

julia> 128*128*sizeof(Float32)
65536

bytes apart.

Thanks for reporting and interesting to see that the performance is so much sensitive to the array sizes. From experience I know optimal array sizes are critical when doing operations on the GPU, but never experienced much on the CPU side. But all makes sense on CPU as well, as there are certainly optimal numbers that would minimise cache-misses and have optimal cache stride.

Always benchmark at multiple sizes, and not just at power-of-2s.

So in conclusion (correct me if I am wrong):

on the CPU, your LV VB example and ParallelStencil (PS) using @tturbo for the outer loop exhibit similar performance when executed on arrays of same sizes and using latest versions of VB, LV. Convolution approaches are less optimal.
on the GPU, PS performs better than cudnn (factor ~ 14 | 150µs vs 7ms)
Execution using PS shows now somehow lower perf for the GPU compared to the CPU. One could investigate what causes this. It may be an under occupancy of the GPU, not fully optimised implementation, or else. We may want to see what @omlins thinks about all this.

Thanks very much for having done all these investigations @chriselrod !

Thanks @chriselrod for sharing your benchmarks with us!

Here are a couple of thoughts with respect to the cross-over point that you have mentioned:

The problem fits in the L2 cache of the CPU, but not in the GPU caches (https://arxiv.org/pdf/1903.07486.pdf); thus, the benchmarks show in the very essence a comparison of the CPU's L2 cache access speed against the GPU's main memory (meaning GDDR memory) access speed. The benchmarks show in consequence correctly that the CPU is significantly faster in this case.
The result would be the opposite if the problem size was much bigger than the cache sizes, as then it would be in the essence a comparison of the CPU's main memory access speed against the GPU's main memory access speed. In this case the GPU would be much faster.
The cross-over point at which the GPU becomes faster than the CPU will certainly be at a problem size that does not fit anymore into the L2 cache size of the CPU, i.e. when problem size = x * CPU L2 cache size, where x>1. It will be interesting to see how big x will be for different hardware.

Execution using PS shows now somehow lower perf for the GPU compared to the CPU.

This is to be expected for this problem size that fits in the CPU L2 cache but not into the GPU caches - see my comment above. For significantly larger problems, PS on GPU will deliver the best performance.

omlins / ParallelStencil.jl

Test against cudnn? #30