Thread (CPU) Float32/Float64 performance comparison on miniapp acoustic2D

LaurentPlagne commented 1 year ago

Hi,

I am a bit surprised by the performance comparison of miniapp acoustic2D.jl when I switch from original Float64 to Float32 FP precision. I use this settings :

 # Numerics
    nx, ny    = 2047, 2047    # numerical grid resolution; should be a mulitple of 32-1 for optimal GPU perf
    nt        = 1000        # number of timesteps
    nout      = 1000          # plotting frequency

On my mac M1 max I obtain (julia 1.8.5 with 8 threads and no boundcheck)

Float64

Animation directory: ./viz2D_out/
Total steps=1000, time=4.920e+00 sec (@ T_eff = 40.00 GB/s) 
[ Info: Saved animation to /Users/laurentplagne/Projects/ParallelStencil.jl/miniapps/acoustic2D.gif

and with Float32

Animation directory: ./viz2D_out/
Total steps=1000, time=4.534e+00 sec (@ T_eff = 22.00 GB/s) 
[ Info: Saved animation to /Users/laurentplagne/Projects/ParallelStencil.jl/miniapps/acoustic2D.gif

So, basically no speed up at all, although I assumed that this case was strongly memory bound and would have expect Float32 to run twice as fast as Float64.

Last, the comment for the Teff computation seems to be deprecated

  A_eff    = (3*2)/1e9*nx*ny*sizeof(Data.Number)  # Effective main memory access per iteration [GB] (Lower bound of required memory access: H and dHdτ have to be read and written (dHdτ for damping): 4 whole-array memaccess; B has to be read: 1 whole-array memaccess)

Any hint ? P.S. Here is the slithly modified miniapp (with a variable FP)

const USE_GPU = false  # Use GPU? If this is set false, then no GPU needs to be available
using ParallelStencil
using ParallelStencil.FiniteDifferences2D

const FP = Float64

@static if USE_GPU
    @init_parallel_stencil(CUDA, FP, 2)
else
    @init_parallel_stencil(Threads, FP, 2)
end
using Plots, Printf, Statistics

@parallel function compute_V!(Vx::Data.Array, Vy::Data.Array, P::Data.Array, dt::Data.Number, ρ::Data.Number, dx::Data.Number, dy::Data.Number)
    @inn(Vx) = @inn(Vx) - dt/ρ*@d_xi(P)/dx
    @inn(Vy) = @inn(Vy) - dt/ρ*@d_yi(P)/dy
    return
end

@parallel function compute_P!(P::Data.Array, Vx::Data.Array, Vy::Data.Array, dt::Data.Number, k::Data.Number, dx::Data.Number, dy::Data.Number)
    @all(P) = @all(P) - dt*k*(@d_xa(Vx)/dx + @d_ya(Vy)/dy)
    return
end

##################################################
@views function acoustic2D()
    # Physics
    lx, ly    = 40.0, 40.0  # domain extends
    k         = 1.0         # bulk modulus
    ρ         = 1.0         # density
    t         = 0.0         # physical time
    # Numerics
    nx, ny    = 2047, 2047    # numerical grid resolution; should be a mulitple of 32-1 for optimal GPU perf
    nt        = 1000        # number of timesteps
    nout      = 1000          # plotting frequency
    # Derived numerics
    dx, dy    = lx/(nx-1), ly/(ny-1) # cell sizes
    # Array allocations
    P         = @zeros(nx  ,ny  )
    Vx        = @zeros(nx+1,ny  )
    Vy        = @zeros(nx  ,ny+1)
    # Initial conditions
    P        .= Data.Array([exp(-((ix-1)*dx-0.5*lx)^2 -((iy-1)*dy-0.5*ly)^2) for ix=1:size(P,1), iy=1:size(P,2)])
    dt        = min(dx,dy)/sqrt(k/ρ)/4.1
    # Preparation of visualisation
    ENV["GKSwstype"]="nul"; if isdir("viz2D_out")==false mkdir("viz2D_out") end; loadpath = "./viz2D_out/"; anim = Animation(loadpath,String[])
    println("Animation directory: $(anim.dir)")
    X, Y      = -lx/2:dx:lx/2, -ly/2:dy:ly/2
    # Time loop
    for it = 1:nt
        if (it==11)  global wtime0 = Base.time()  end
        @parallel compute_V!(Vx, Vy, P, FP(dt), FP(ρ), FP(dx), FP(dy))
        @parallel compute_P!(P, Vx, Vy, FP(dt), FP(k), FP(dx), FP(dy))
        t = t + dt
        # Visualisation
        if mod(it,nout)==0
            heatmap(X, Y, Array(P)', aspect_ratio=1, xlims=(X[1],X[end]), ylims=(Y[1],Y[end]), c=:viridis, title="Pressure"); frame(anim)
        end
    end
    # Performance
    wtime    = Base.time()-wtime0
    A_eff    = (3*2)/1e9*nx*ny*sizeof(Data.Number)  # Effective main memory access per iteration [GB] (Lower bound of required memory access: H and dHdτ have to be read and written (dHdτ for damping): 4 whole-array memaccess; B has to be read: 1 whole-array memaccess)
    wtime_it = wtime/(nt-10)                        # Execution time per iteration [s]
    T_eff    = A_eff/wtime_it                       # Effective memory throughput [GB/s]
    @printf("Total steps=%d, time=%1.3e sec (@ T_eff = %1.2f GB/s) \n", nt, wtime, round(T_eff, sigdigits=2))
    gif(anim, "acoustic2D.gif", fps = 15)
    return
end

acoustic2D()

luraess commented 1 year ago

This is a long-standing limitation and may certainly relate to https://github.com/JuliaGPU/CUDA.jl/pull/1895

LaurentPlagne commented 1 year ago

I do not use CUDA on this computer (apple silicon) so I am not sure to understand the relation. Anyway I will continue my work with a CUDA enabled computer so this question is not crucial.

luraess commented 1 year ago

I guess the non optimal Float32 performance is a general Julia challenge and may not be strictly relate to only a CUDA.jl thing.

LaurentPlagne commented 1 year ago

I noticed that src/FiniteDifferences.jluses Float64 literals like 8.0,4.0,1.0, 0.5, 0.25, 0.125. Is it possible that the slowdown whith Float32 Numbers is related to conversions happening in the kernels. In this case, it could be beneficial to make the literals more generic (e.g. 0.5 -> inv(eltype($A)(2))

LaurentPlagne commented 1 year ago

After inspecting the execution profile, it seems that these Float64 literals cause no problems. The time is consumed by the getindex calls which increases from Float64to Float32 (which makes no sense to me).

I have to investigate a little bit closer but it seems that the situation is OK (same Teff) for the GPU. I can't help speculating about complicated threading issues (alignment, false sharing/cache invalidation , NUMA effect).

luraess commented 1 year ago

I am not 100% sure but feel like this relates to the issues discussed in https://github.com/JuliaGPU/CUDA.jl/pull/1895 which have reach beyond GPU / CUDA programming. From the discussion there, indexing in Julia is double précision biased and thus it could well be the getindex triggers conversions which slow things down.

LaurentPlagne commented 1 year ago

OK, I do not understand how a CUDA issue could affect my run on CPU. I also have pb to understand how index type are related to FP value type.

I try to run with a single thread with Float32 and the T_eff is still half (4.1GB/s) of what I got with Float64 (8.4 GB/s). Then I tried to modify the src/FiniteDifferences.jl file removing FP64 litterals (see at the EOM). This did not increase the performance (on CPU) but did not decrease it neither. It should be checked on different target but if it is confirmed I think that it is a slight implementation improvement.

src/FiniteDifferences.jl without FP64 litterals

"""
Module FiniteDifferences1D

Provides macros for 1-D finite differences computations. The dimensions x refers to the 1st and only array dimension. The module was designed to enable a math-close notation and, as a consequence, it does not allow for a nested invocation of the provided macros as, e.g., `@d(@d(A))` (instead, one can simply write `@d2(A)`). Additional macros can be straightforwardly added in the caller module to cover eventual missing functionality. The module is intended for usage with the module ParallelStencil.

# Usage
    using ParallelStencil.FiniteDifferences1D

# Macros

###### Differences
- [`@d`](@ref)
- [`@d2`](@ref)

###### Selection
- [`@all`](@ref)
- [`@inn`](@ref)

###### Averages
- [`@av`](@ref)

###### Harmonic averages
- [`@harm`](@ref)

###### Others
- [`@maxloc`](@ref)
- [`@minloc`](@ref)

To see a description of a macro type `?<macroname>` (including the `@`).
"""
module FiniteDifferences1D
export @d, @d2
export @all, @inn
export @av
export @harm
export @maxloc, @minloc
export @within

@doc "`@d(A)`: Compute differences between adjacent elements of `A`." :(@d)
@doc "`@d2(A)`: Compute the 2nd order differences between adjacent elements of `A`." :(@d2)
@doc "`@all(A)`: Select all elements of `A`. Corresponds to `A[:]`." :(@all)
@doc "`@inn(A)`: Select the inner elements of `A`. Corresponds to `A[2:end-1]`" :(@inn)
@doc "`@av(A)`: Compute averages between adjacent elements of `A`." :(@av)
@doc "`@harm(A)`: Compute harmonic averages between adjacent elements of `A`." :(@harm)
@doc "`@maxloc(A)`: Compute the maximum between 2nd order adjacent elements of `A`, using a moving window of size 3." :(@maxloc)
@doc "`@minloc(A)`: Compute the minimum between 2nd order adjacent elements of `A`, using a moving window of size 3." :(@minloc)

import ..ParallelStencil: INDICES, WITHIN_DOC
const ix = INDICES[1]
const ixi = :($ix+1)

macro      d(A::Symbol)  esc(:( $A[$ix+1] - $A[$ix] )) end
macro     d2(A::Symbol)  esc(:( ($A[$ixi+1] - $A[$ixi])  -  ($A[$ixi] - $A[$ixi-1]) )) end
macro    all(A::Symbol)  esc(:( $A[$ix  ] )) end
macro    inn(A::Symbol)  esc(:( $A[$ixi ] )) end
macro     av(A::Symbol)  esc(:(($A[$ix] + $A[$ix+1] )/2 )) end
macro   harm(A::Symbol)  esc(:(1/(1/$A[$ix] + 1/$A[$ix+1])*2 )) end
macro maxloc(A::Symbol)  esc(:( max( max($A[$ixi-1], $A[$ixi+1]), $A[$ixi] ) )) end
macro minloc(A::Symbol)  esc(:( min( min($A[$ixi-1], $A[$ixi+1]), $A[$ixi] ) )) end

@doc WITHIN_DOC
macro within(macroname::String, A::Symbol)
    if     macroname == "@all"  esc(  :($ix<=size($A,1)  )  )
    elseif macroname == "@inn"  esc(  :($ix<=size($A,1)-2)  )
    else error("unkown macroname: $macroname. If you want to add your own assignement macros, overwrite the macro 'within(macroname::String, A::Symbol)'; to still use the exising macro within as well call ParallelStencil.FiniteDifferences{1|2|3}D.@within(macroname, A) at the end.")
    end
end

end # Module FiniteDifferences1D

"""
Module FiniteDifferences2D

Provides macros for 2-D finite differences computations. The dimensions x and y refer to the 1st and 2nd array dimensions, respectively. The module was designed to enable a math-close notation and, as a consequence, it does not allow for a nested invocation of the provided macros as, e.g., `@inn_y(@d_xa(A))` (instead, one can simply write `@d_xi(A)`). Additional macros can be straightforwardly added in the caller module to cover eventual missing functionality. The module is intended for usage with the module ParallelStencil.

# Usage
    using ParallelStencil.FiniteDifferences2D

# Macros

###### Differences
- [`@d_xa`](@ref)
- [`@d_ya`](@ref)
- [`@d_xi`](@ref)
- [`@d_yi`](@ref)
- [`@d2_xi`](@ref)
- [`@d2_yi`](@ref)

###### Selection
- [`@all`](@ref)
- [`@inn`](@ref)
- [`@inn_x`](@ref)
- [`@inn_y`](@ref)

###### Averages
- [`@av`](@ref)
- [`@av_xa`](@ref)
- [`@av_ya`](@ref)
- [`@av_xi`](@ref)
- [`@av_yi`](@ref)

###### Harmonic averages
- [`@harm`](@ref)
- [`@harm_xa`](@ref)
- [`@harm_ya`](@ref)
- [`@harm_xi`](@ref)
- [`@harm_yi`](@ref)

###### Others
- [`@maxloc`](@ref)
- [`@minloc`](@ref)

To see a description of a macro type `?<macroname>` (including the `@`).
"""
module FiniteDifferences2D
export @d_xa, @d_ya, @d_xi, @d_yi, @d2_xi, @d2_yi
export @all, @inn, @inn_x, @inn_y
export @av, @av_xa, @av_ya, @av_xi, @av_yi
export @harm, @harm_xa, @harm_ya, @harm_xi, @harm_yi
export @maxloc, @minloc
export @within

@doc "`@d_xa(A)`: Compute differences between adjacent elements of `A` along the dimension x." :(@d_xa)
@doc "`@d_ya(A)`: Compute differences between adjacent elements of `A` along the dimension y." :(@d_ya)
@doc "`@d_xi(A)`: Compute differences between adjacent elements of `A` along the dimension x and select the inner elements of `A` in the remaining dimension. Corresponds to `@inn_y(@d_xa(A))`." :(@d_xi)
@doc "`@d_yi(A)`: Compute differences between adjacent elements of `A` along the dimension y and select the inner elements of `A` in the remaining dimension. Corresponds to `@inn_x(@d_ya(A))`." :(@d_yi)
@doc "`@d2_xi(A)`: Compute the 2nd order differences between adjacent elements of `A` along the dimension x and select the inner elements of `A` in the remaining dimension. Corresponds to `@inn_y(@d2_xa(A))`." :(@d2_xi)
@doc "`@d2_yi(A)`: Compute the 2nd order differences between adjacent elements of `A` along the dimension y and select the inner elements of `A` in the remaining dimension. Corresponds to `@inn_x(@d2_ya(A))`." :(@d2_yi)
@doc "`@all(A)`: Select all elements of `A`. Corresponds to `A[:,:]`." :(@all)
@doc "`@inn(A)`: Select the inner elements of `A`. Corresponds to `A[2:end-1,2:end-1]`." :(@inn)
@doc "`@inn_x(A)`: Select the inner elements of `A` in dimension x. Corresponds to `A[2:end-1,:]`." :(@inn_x)
@doc "`@inn_y(A)`: Select the inner elements of `A` in dimension y. Corresponds to `A[:,2:end-1]`." :(@inn_y)
@doc "`@av(A)`: Compute averages between adjacent elements of `A` along the dimensions x and y." :(@av)
@doc "`@av_xa(A)`: Compute averages between adjacent elements of `A` along the dimension x." :(@av_xa)
@doc "`@av_ya(A)`: Compute averages between adjacent elements of `A` along the dimension y." :(@av_ya)
@doc "`@av_xi(A)`: Compute averages between adjacent elements of `A` along the dimension x and select the inner elements of `A` in the remaining dimension. Corresponds to `@inn_y(@av_xa(A))`." :(@av_xi)
@doc "`@av_yi(A)`: Compute averages between adjacent elements of `A` along the dimension y and select the inner elements of `A` in the remaining dimension. Corresponds to `@inn_x(@av_ya(A))`." :(@av_yi)
@doc "`@harm(A)`: Compute harmonic averages between adjacent elements of `A` along the dimensions x and y." :(@harm)
@doc "`@harm_xa(A)`: Compute harmonic averages between adjacent elements of `A` along the dimension x." :(@harm_xa)
@doc "`@harm_ya(A)`: Compute harmonic averages between adjacent elements of `A` along the dimension y." :(@harm_ya)
@doc "`@harm_xi(A)`: Compute harmonic averages between adjacent elements of `A` along the dimension x and select the inner elements of `A` in the remaining dimension. Corresponds to `@inn_y(@harm_xa(A))`." :(@harm_xi)
@doc "`@harm_yi(A)`: Compute harmonic averages between adjacent elements of `A` along the dimension y and select the inner elements of `A` in the remaining dimension. Corresponds to `@inn_x(@harm_ya(A))`." :(@harm_yi)
@doc "`@maxloc(A)`: Compute the maximum between 2nd order adjacent elements of `A`, using a moving window of size 3." :(@maxloc)
@doc "`@minloc(A)`: Compute the minimum between 2nd order adjacent elements of `A`, using a moving window of size 3." :(@minloc)

import ..ParallelStencil: INDICES, WITHIN_DOC
ix, iy = INDICES[1], INDICES[2]
ixi, iyi = :($ix+1), :($iy+1)

macro     d_xa(A::Symbol)  esc(:( $A[$ix+1,$iy  ] - $A[$ix  ,$iy ] )) end
macro     d_ya(A::Symbol)  esc(:( $A[$ix  ,$iy+1] - $A[$ix  ,$iy ] )) end
macro     d_xi(A::Symbol)  esc(:( $A[$ix+1,$iyi ] - $A[$ix  ,$iyi] )) end
macro     d_yi(A::Symbol)  esc(:( $A[$ixi ,$iy+1] - $A[$ixi ,$iy ] )) end
macro    d2_xi(A::Symbol)  esc(:( ($A[$ixi+1,$iyi  ] - $A[$ixi ,$iyi])  -  ($A[$ixi ,$iyi] - $A[$ixi-1,$iyi  ]) )) end
macro    d2_yi(A::Symbol)  esc(:( ($A[$ixi  ,$iyi+1] - $A[$ixi ,$iyi])  -  ($A[$ixi ,$iyi] - $A[$ixi  ,$iyi-1]) )) end
macro      all(A::Symbol)  esc(:( $A[$ix  ,$iy  ] )) end
macro      inn(A::Symbol)  esc(:( $A[$ixi ,$iyi ] )) end
macro    inn_x(A::Symbol)  esc(:( $A[$ixi ,$iy  ] )) end
macro    inn_y(A::Symbol)  esc(:( $A[$ix  ,$iyi ] )) end
macro       av(A::Symbol)  esc(:(($A[$ix  ,$iy  ] + $A[$ix+1,$iy  ] + $A[$ix,$iy+1] + $A[$ix+1,$iy+1])/4 )) end
macro    av_xa(A::Symbol)  esc(:(($A[$ix  ,$iy  ] + $A[$ix+1,$iy  ] )/2 )) end
macro    av_ya(A::Symbol)  esc(:(($A[$ix  ,$iy  ] + $A[$ix  ,$iy+1] )/2 )) end
macro    av_xi(A::Symbol)  esc(:(($A[$ix  ,$iyi ] + $A[$ix+1,$iyi ] )/2 )) end
macro    av_yi(A::Symbol)  esc(:(($A[$ixi ,$iy  ] + $A[$ixi ,$iy+1] )/2 )) end
macro     harm(A::Symbol)  esc(:(1/(1/$A[$ix  ,$iy  ] + 1/$A[$ix+1,$iy  ] + 1/$A[$ix,$iy+1] + 1/$A[$ix+1,$iy+1])*4 )) end
macro  harm_xa(A::Symbol)  esc(:(1/(1/$A[$ix  ,$iy  ] + 1/$A[$ix+1,$iy  ] )*2 )) end
macro  harm_ya(A::Symbol)  esc(:(1/(1/$A[$ix  ,$iy  ] + 1/$A[$ix  ,$iy+1] )*2 )) end
macro  harm_xi(A::Symbol)  esc(:(1/(1/$A[$ix  ,$iyi ] + 1/$A[$ix+1,$iyi ] )*2 )) end
macro  harm_yi(A::Symbol)  esc(:(1/(1/$A[$ixi ,$iy  ] + 1/$A[$ixi ,$iy+1] )*2 )) end
macro   maxloc(A::Symbol)  esc(:( max( max( max($A[$ixi-1,$iyi  ], $A[$ixi+1,$iyi  ])  , $A[$ixi  ,$iyi  ] ),
                                            max($A[$ixi  ,$iyi-1], $A[$ixi  ,$iyi+1]) ) )) end
macro   minloc(A::Symbol)  esc(:( min( min( min($A[$ixi-1,$iyi  ], $A[$ixi+1,$iyi  ])  , $A[$ixi  ,$iyi  ] ),
                                            min($A[$ixi  ,$iyi-1], $A[$ixi  ,$iyi+1]) ) )) end

@doc WITHIN_DOC
macro within(macroname::String, A::Symbol)
    if     macroname == "@all"    esc(  :($ix<=size($A,1)   && $iy<=size($A,2)  )  )
    elseif macroname == "@inn"    esc(  :($ix<=size($A,1)-2 && $iy<=size($A,2)-2)  )
    elseif macroname == "@inn_x"  esc(  :($ix<=size($A,1)-2 && $iy<=size($A,2)  )  )
    elseif macroname == "@inn_y"  esc(  :($ix<=size($A,1)   && $iy<=size($A,2)-2)  )
    else error("unkown macroname: $macroname. If you want to add your own assignement macros, overwrite the macro 'within(macroname::String, A::Symbol)'; to still use the exising macro within as well call ParallelStencil.FiniteDifferences{1|2|3}D.@within(macroname, A) at the end.")
    end
end

end # Module FiniteDifferences2D

"""
Module FiniteDifferences3D

Provides macros for 3-D finite differences computations. The dimensions x, y and z refer to the 1st, 2nd and 3rd array dimensions, respectively. The module was designed to enable a math-close notation and, as a consequence, it does not allow for a nested invocation of the provided macros as, e.g., `@inn_z(@inn_y(@d_xa(A)))` (instead, one can simply write `@d_xi(A)`). Additional macros can be straightforwardly added in the caller module to cover eventual missing functionality. The module is intended for usage with the module ParallelStencil.

# Usage
    using ParallelStencil.FiniteDifferences3D

# Macros

###### Differences
- [`@d_xa`](@ref)
- [`@d_ya`](@ref)
- [`@d_za`](@ref)
- [`@d_xi`](@ref)
- [`@d_yi`](@ref)
- [`@d_zi`](@ref)
- [`@d2_xi`](@ref)
- [`@d2_yi`](@ref)
- [`@d2_zi`](@ref)

###### Selection
- [`@all`](@ref)
- [`@inn`](@ref)
- [`@inn_x`](@ref)
- [`@inn_y`](@ref)
- [`@inn_z`](@ref)
- [`@inn_xy`](@ref)
- [`@inn_xz`](@ref)
- [`@inn_yz`](@ref)

###### Averages
- [`@av`](@ref)
- [`@av_xa`](@ref)
- [`@av_ya`](@ref)
- [`@av_za`](@ref)
- [`@av_xi`](@ref)
- [`@av_yi`](@ref)
- [`@av_zi`](@ref)
- [`@av_xya`](@ref)
- [`@av_xza`](@ref)
- [`@av_yza`](@ref)
- [`@av_xyi`](@ref)
- [`@av_xzi`](@ref)
- [`@av_yzi`](@ref)

###### Harmonic averages
- [`@harm`](@ref)
- [`@harm_xa`](@ref)
- [`@harm_ya`](@ref)
- [`@harm_za`](@ref)
- [`@harm_xi`](@ref)
- [`@harm_yi`](@ref)
- [`@harm_zi`](@ref)
- [`@harm_xya`](@ref)
- [`@harm_xza`](@ref)
- [`@harm_yza`](@ref)
- [`@harm_xyi`](@ref)
- [`@harm_xzi`](@ref)
- [`@harm_yzi`](@ref)

###### Others
- [`@maxloc`](@ref)
- [`@minloc`](@ref)

To see a description of a macro type `?<macroname>` (including the `@`).
"""
module FiniteDifferences3D
export @d_xa, @d_ya, @d_za, @d_xi, @d_yi, @d_zi, @d2_xi, @d2_yi, @d2_zi
export @all, @inn, @inn_x, @inn_y, @inn_z, @inn_xy, @inn_xz, @inn_yz
export @av, @av_xa, @av_ya, @av_za, @av_xi, @av_yi, @av_zi, @av_xya, @av_xza, @av_yza, @av_xyi, @av_xzi, @av_yzi #, @av_xya2, @av_xza2, @av_yza2
export @harm, @harm_xa, @harm_ya, @harm_za, @harm_xi, @harm_yi, @harm_zi, @harm_xya, @harm_xza, @harm_yza, @harm_xyi, @harm_xzi, @harm_yzi 
export @maxloc, @minloc
export @within

@doc "`@d_xa(A)`: Compute differences between adjacent elements of `A` along the dimension x." :(@d_xa)
@doc "`@d_ya(A)`: Compute differences between adjacent elements of `A` along the dimension y." :(@d_ya)
@doc "`@d_za(A)`: Compute differences between adjacent elements of `A` along the dimension z." :(@d_za)
@doc "`@d_xi(A)`: Compute differences between adjacent elements of `A` along the dimension x and select the inner elements of `A` in the remaining dimensions. Corresponds to `@inn_yz(@d_xa(A))`." :(@d_xi)
@doc "`@d_yi(A)`: Compute differences between adjacent elements of `A` along the dimension y and select the inner elements of `A` in the remaining dimensions. Corresponds to `@inn_xz(@d_ya(A))`." :(@d_yi)
@doc "`@d_zi(A)`: Compute differences between adjacent elements of `A` along the dimension z and select the inner elements of `A` in the remaining dimensions. Corresponds to `@inn_xy(@d_za(A))`." :(@d_zi)
@doc "`@d2_xi(A)`: Compute the 2nd order differences between adjacent elements of `A` along the dimension x and select the inner elements of `A` in the remaining dimensions. Corresponds to `@inn_yz(@d2_xa(A))`." :(@d2_xi)
@doc "`@d2_yi(A)`: Compute the 2nd order differences between adjacent elements of `A` along the dimension y and select the inner elements of `A` in the remaining dimensions. Corresponds to `@inn_xz(@d2_ya(A))`." :(@d2_yi)
@doc "`@d2_zi(A)`: Compute the 2nd order differences between adjacent elements of `A` along the dimension y and select the inner elements of `A` in the remaining dimensions. Corresponds to `@inn_xy(@d2_za(A))`." :(@d2_zi)
@doc "`@all(A)`: Select all elements of `A`. Corresponds to `A[:,:,:]`." :(@all)
@doc "`@inn(A)`: Select the inner elements of `A`. Corresponds to `A[2:end-1,2:end-1,2:end-1]`." :(@inn)
@doc "`@inn_x(A)`: Select the inner elements of `A` in dimension x. Corresponds to `A[2:end-1,:,:]`." :(@inn_x)
@doc "`@inn_y(A)`: Select the inner elements of `A` in dimension y. Corresponds to `A[:,2:end-1,:]`." :(@inn_y)
@doc "`@inn_z(A)`: Select the inner elements of `A` in dimension z. Corresponds to `A[:,:,2:end-1]`." :(@inn_z)
@doc "`@inn_xy(A)`: Select the inner elements of `A` in dimensions x and y. Corresponds to `A[2:end-1,2:end-1,:]`." :(@inn_xy)
@doc "`@inn_xz(A)`: Select the inner elements of `A` in dimensions x and z. Corresponds to `A[2:end-1,:,2:end-1]`." :(@inn_xz)
@doc "`@inn_yz(A)`: Select the inner elements of `A` in dimensions y and z. Corresponds to `A[:,2:end-1,2:end-1]`." :(@inn_yz)
@doc "`@av(A)`: Compute averages between adjacent elements of `A` along the dimensions x and y and z." :(@av)
@doc "`@av_xa(A)`: Compute averages between adjacent elements of `A` along the dimension x." :(@av_xa)
@doc "`@av_ya(A)`: Compute averages between adjacent elements of `A` along the dimension y." :(@av_ya)
@doc "`@av_za(A)`: Compute averages between adjacent elements of `A` along the dimension z." :(@av_za)
@doc "`@av_xi(A)`: Compute averages between adjacent elements of `A` along the dimension x and select the inner elements of `A` in the remaining dimensions. Corresponds to `@inn_yz(@av_xa(A))`." :(@av_xi)
@doc "`@av_yi(A)`: Compute averages between adjacent elements of `A` along the dimension y and select the inner elements of `A` in the remaining dimensions. Corresponds to `@inn_xz(@av_ya(A))`." :(@av_yi)
@doc "`@av_zi(A)`: Compute averages between adjacent elements of `A` along the dimension z and select the inner elements of `A` in the remaining dimensions. Corresponds to `@inn_xy(@av_za(A))`." :(@av_zi)
@doc "`@av_xya(A)`: Compute averages between adjacent elements of `A` along the dimensions x and y." :(@av_xya)
@doc "`@av_xza(A)`: Compute averages between adjacent elements of `A` along the dimensions x and z." :(@av_xza)
@doc "`@av_yza(A)`: Compute averages between adjacent elements of `A` along the dimensions y and z." :(@av_yza)
@doc "`@av_xyi(A)`: Compute averages between adjacent elements of `A` along the dimensions x and y and select the inner elements of `A` in the remaining dimension. Corresponds to `@inn_z(@av_xya(A))`." :(@av_xyi)
@doc "`@av_xzi(A)`: Compute averages between adjacent elements of `A` along the dimensions x and z and select the inner elements of `A` in the remaining dimension. Corresponds to `@inn_y(@av_xza(A))`." :(@av_xzi)
@doc "`@av_yzi(A)`: Compute averages between adjacent elements of `A` along the dimensions y and z and select the inner elements of `A` in the remaining dimension. Corresponds to `@inn_x(@av_yza(A))`." :(@av_yzi)
@doc "`@harm(A)`: Compute harmonic averages between adjacent elements of `A` along the dimensions x and y and z." :(@harm)
@doc "`@harm_xa(A)`: Compute harmonic averages between adjacent elements of `A` along the dimension x." :(@harm_xa)
@doc "`@harm_ya(A)`: Compute harmonic averages between adjacent elements of `A` along the dimension y." :(@harm_ya)
@doc "`@harm_za(A)`: Compute harmonic averages between adjacent elements of `A` along the dimension z." :(@harm_za)
@doc "`@harm_xi(A)`: Compute harmonic averages between adjacent elements of `A` along the dimension x and select the inner elements of `A` in the remaining dimensions. Corresponds to `@inn_yz(@harm_xa(A))`." :(@harm_xi)
@doc "`@harm_yi(A)`: Compute harmonic averages between adjacent elements of `A` along the dimension y and select the inner elements of `A` in the remaining dimensions. Corresponds to `@inn_xz(@harm_ya(A))`." :(@harm_yi)
@doc "`@harm_zi(A)`: Compute harmonic averages between adjacent elements of `A` along the dimension z and select the inner elements of `A` in the remaining dimensions. Corresponds to `@inn_xy(@harm_za(A))`." :(@harm_zi)
@doc "`@harm_xya(A)`: Compute harmonic averages between adjacent elements of `A` along the dimensions x and y." :(@harm_xya)
@doc "`@harm_xza(A)`: Compute harmonic averages between adjacent elements of `A` along the dimensions x and z." :(@harm_xza)
@doc "`@harm_yza(A)`: Compute harmonic averages between adjacent elements of `A` along the dimensions y and z." :(@harm_yza)
@doc "`@harm_xyi(A)`: Compute harmonic averages between adjacent elements of `A` along the dimensions x and y and select the inner elements of `A` in the remaining dimension. Corresponds to `@inn_z(@harm_xya(A))`." :(@harm_xyi)
@doc "`@harm_xzi(A)`: Compute harmonic averages between adjacent elements of `A` along the dimensions x and z and select the inner elements of `A` in the remaining dimension. Corresponds to `@inn_y(@harm_xza(A))`." :(@harm_xzi)
@doc "`@harm_yzi(A)`: Compute harmonic averages between adjacent elements of `A` along the dimensions y and z and select the inner elements of `A` in the remaining dimension. Corresponds to `@inn_x(@harm_yza(A))`." :(@harm_yzi)
@doc "`@maxloc(A)`: Compute the maximum between 2nd order adjacent elements of `A`, using a moving window of size 3." :(@maxloc)
@doc "`@minloc(A)`: Compute the minimum between 2nd order adjacent elements of `A`, using a moving window of size 3." :(@minloc)

import ..ParallelStencil: INDICES, WITHIN_DOC
ix, iy, iz = INDICES[1], INDICES[2], INDICES[3]
ixi, iyi, izi = :($ix+1), :($iy+1), :($iz+1)

macro     d_xa(A::Symbol)   esc(:( $A[$ix+1,$iy  ,$iz  ] - $A[$ix  ,$iy  ,$iz  ] )) end
macro     d_ya(A::Symbol)   esc(:( $A[$ix  ,$iy+1,$iz  ] - $A[$ix  ,$iy  ,$iz  ] )) end
macro     d_za(A::Symbol)   esc(:( $A[$ix  ,$iy  ,$iz+1] - $A[$ix  ,$iy  ,$iz  ] )) end
macro     d_xi(A::Symbol)   esc(:( $A[$ix+1,$iyi ,$izi ] - $A[$ix  ,$iyi ,$izi ] )) end
macro     d_yi(A::Symbol)   esc(:( $A[$ixi ,$iy+1,$izi ] - $A[$ixi ,$iy  ,$izi ] )) end
macro     d_zi(A::Symbol)   esc(:( $A[$ixi ,$iyi ,$iz+1] - $A[$ixi ,$iyi ,$iz  ] )) end
macro    d2_xi(A::Symbol)   esc(:( ($A[$ixi+1,$iyi  ,$izi  ] - $A[$ixi ,$iyi ,$izi ])  -  ($A[$ixi ,$iyi ,$izi ] - $A[$ixi-1,$iyi  ,$izi  ]) )) end
macro    d2_yi(A::Symbol)   esc(:( ($A[$ixi  ,$iyi+1,$izi  ] - $A[$ixi ,$iyi ,$izi ])  -  ($A[$ixi ,$iyi ,$izi ] - $A[$ixi  ,$iyi-1,$izi  ]) )) end
macro    d2_zi(A::Symbol)   esc(:( ($A[$ixi  ,$iyi  ,$izi+1] - $A[$ixi ,$iyi ,$izi ])  -  ($A[$ixi ,$iyi ,$izi ] - $A[$ixi  ,$iyi  ,$izi-1]) )) end
macro      all(A::Symbol)   esc(:( $A[$ix  ,$iy  ,$iz  ] )) end
macro      inn(A::Symbol)   esc(:( $A[$ixi ,$iyi ,$izi ] )) end
macro    inn_x(A::Symbol)   esc(:( $A[$ixi ,$iy  ,$iz  ] )) end
macro    inn_y(A::Symbol)   esc(:( $A[$ix  ,$iyi ,$iz  ] )) end
macro    inn_z(A::Symbol)   esc(:( $A[$ix  ,$iy  ,$izi ] )) end
macro   inn_xy(A::Symbol)   esc(:( $A[$ixi ,$iyi ,$iz  ] )) end
macro   inn_xz(A::Symbol)   esc(:( $A[$ixi ,$iy  ,$izi ] )) end
macro   inn_yz(A::Symbol)   esc(:( $A[$ix  ,$iyi ,$izi ] )) end
macro       av(A::Symbol)   esc(:(($A[$ix  ,$iy  ,$iz  ] + $A[$ix+1,$iy  ,$iz  ] +
                                   $A[$ix+1,$iy+1,$iz  ] + $A[$ix+1,$iy+1,$iz+1] +
                                   $A[$ix  ,$iy+1,$iz+1] + $A[$ix  ,$iy  ,$iz+1] +
                                   $A[$ix+1,$iy  ,$iz+1] + $A[$ix  ,$iy+1,$iz  ] )/8)) end
macro    av_xa(A::Symbol)   esc(:(($A[$ix  ,$iy  ,$iz  ] + $A[$ix+1,$iy  ,$iz  ] )/2 )) end
macro    av_ya(A::Symbol)   esc(:(($A[$ix  ,$iy  ,$iz  ] + $A[$ix  ,$iy+1,$iz  ] )/2 )) end
macro    av_za(A::Symbol)   esc(:(($A[$ix  ,$iy  ,$iz  ] + $A[$ix  ,$iy  ,$iz+1] )/2 )) end
macro    av_xi(A::Symbol)   esc(:(($A[$ix  ,$iyi ,$izi ] + $A[$ix+1,$iyi ,$izi ] )/2 )) end
macro    av_yi(A::Symbol)   esc(:(($A[$ixi ,$iy  ,$izi ] + $A[$ixi ,$iy+1,$izi ] )/2 )) end
macro    av_zi(A::Symbol)   esc(:(($A[$ixi ,$iyi ,$iz  ] + $A[$ixi ,$iyi ,$iz+1] )/2 )) end
macro   av_xya(A::Symbol)   esc(:(($A[$ix  ,$iy  ,$iz  ] + $A[$ix+1,$iy  ,$iz  ] +
                                   $A[$ix  ,$iy+1,$iz  ] + $A[$ix+1,$iy+1,$iz  ] )/4 )) end
macro   av_xza(A::Symbol)   esc(:(($A[$ix  ,$iy  ,$iz  ] + $A[$ix+1,$iy  ,$iz  ] +
                                   $A[$ix  ,$iy  ,$iz+1] + $A[$ix+1,$iy  ,$iz+1] )/4 )) end
macro   av_yza(A::Symbol)   esc(:(($A[$ix  ,$iy  ,$iz  ] + $A[$ix  ,$iy+1,$iz  ] +
                                   $A[$ix  ,$iy  ,$iz+1] + $A[$ix  ,$iy+1,$iz+1] )/4 )) end
macro   av_xyi(A::Symbol)   esc(:(($A[$ix  ,$iy  ,$izi ] + $A[$ix+1,$iy  ,$izi ] +
                                   $A[$ix  ,$iy+1,$izi ] + $A[$ix+1,$iy+1,$izi ] )/4 )) end
macro   av_xzi(A::Symbol)   esc(:(($A[$ix  ,$iyi ,$iz  ] + $A[$ix+1,$iyi ,$iz  ] +
                                   $A[$ix  ,$iyi ,$iz+1] + $A[$ix+1,$iyi ,$iz+1] )/4 )) end
macro   av_yzi(A::Symbol)   esc(:(($A[$ixi ,$iy  ,$iz  ] + $A[$ixi ,$iy+1,$iz  ] +
                                   $A[$ixi ,$iy  ,$iz+1] + $A[$ixi ,$iy+1,$iz+1] )/4 )) end
macro     harm(A::Symbol)  esc(:(1/(1/$A[$ix  ,$iy  ,$iz  ] + 1/$A[$ix+1,$iy  ,$iz  ] +
                                      1/$A[$ix+1,$iy+1,$iz  ] + 1/$A[$ix+1,$iy+1,$iz+1] +
                                      1/$A[$ix  ,$iy+1,$iz+1] + 1/$A[$ix  ,$iy  ,$iz+1] +
                                      1/$A[$ix+1,$iy  ,$iz+1] + 1/$A[$ix  ,$iy+1,$iz  ] )*8)) end
macro  harm_xa(A::Symbol)  esc(:(1/(1/$A[$ix  ,$iy  ,$iz  ] + 1/$A[$ix+1,$iy  ,$iz  ] )*2 )) end
macro  harm_ya(A::Symbol)  esc(:(1/(1/$A[$ix  ,$iy  ,$iz  ] + 1/$A[$ix  ,$iy+1,$iz  ] )*2 )) end
macro  harm_za(A::Symbol)  esc(:(1/(1/$A[$ix  ,$iy  ,$iz  ] + 1/$A[$ix  ,$iy  ,$iz+1] )*2 )) end
macro  harm_xi(A::Symbol)  esc(:(1/(1/$A[$ix  ,$iyi ,$izi ] + 1/$A[$ix+1,$iyi ,$izi ] )*2 )) end
macro  harm_yi(A::Symbol)  esc(:(1/(1/$A[$ixi ,$iy  ,$izi ] + 1/$A[$ixi ,$iy+1,$izi ] )*2 )) end
macro  harm_zi(A::Symbol)  esc(:(1/(1/$A[$ixi ,$iyi ,$iz  ] + 1/$A[$ixi ,$iyi ,$iz+1] )*2 )) end
macro harm_xya(A::Symbol)  esc(:(1/(1/$A[$ix  ,$iy  ,$iz  ] + 1/$A[$ix+1,$iy  ,$iz  ] +
                                      1/$A[$ix  ,$iy+1,$iz  ] + 1/$A[$ix+1,$iy+1,$iz  ] )*4 )) end
macro harm_xza(A::Symbol)  esc(:(1/(1/$A[$ix  ,$iy  ,$iz  ] + 1/$A[$ix+1,$iy  ,$iz  ] +
                                      1/$A[$ix  ,$iy  ,$iz+1] + 1/$A[$ix+1,$iy  ,$iz+1] )*4 )) end
macro harm_yza(A::Symbol)  esc(:(1/(1/$A[$ix  ,$iy  ,$iz  ] + 1/$A[$ix  ,$iy+1,$iz  ] +
                                      1/$A[$ix  ,$iy  ,$iz+1] + 1/$A[$ix  ,$iy+1,$iz+1] )*4 )) end
macro harm_xyi(A::Symbol)  esc(:(1/(1/$A[$ix  ,$iy  ,$izi ] + 1/$A[$ix+1,$iy  ,$izi ] +
                                      1/$A[$ix  ,$iy+1,$izi ] + 1/$A[$ix+1,$iy+1,$izi ] )*4 )) end
macro harm_xzi(A::Symbol)  esc(:(1/(1/$A[$ix  ,$iyi ,$iz  ] + 1/$A[$ix+1,$iyi ,$iz  ] +
                                      1/$A[$ix  ,$iyi ,$iz+1] + 1/$A[$ix+1,$iyi ,$iz+1] )*4 )) end
macro harm_yzi(A::Symbol)  esc(:(1/(1/$A[$ixi ,$iy  ,$iz  ] + 1/$A[$ixi ,$iy+1,$iz  ] +
                                      1/$A[$ixi ,$iy  ,$iz+1] + 1/$A[$ixi ,$iy+1,$iz+1] )*4 )) end
macro   maxloc(A::Symbol)  esc(:( max( max( max( max($A[$ixi-1,$iyi  ,$izi  ], $A[$ixi+1,$iyi  ,$izi  ])  , $A[$ixi  ,$iyi  ,$izi  ] ),
                                              max($A[$ixi  ,$iyi-1,$izi  ], $A[$ixi  ,$iyi+1,$izi  ]) ),
                                              max($A[$ixi  ,$iyi  ,$izi-1], $A[$ixi  ,$iyi  ,$izi+1]) ) )) end
macro   minloc(A::Symbol)  esc(:( min( min( min( min($A[$ixi-1,$iyi  ,$izi  ], $A[$ixi+1,$iyi  ,$izi  ])  , $A[$ixi  ,$iyi  ,$izi  ] ),
                                              min($A[$ixi  ,$iyi-1,$izi  ], $A[$ixi  ,$iyi+1,$izi  ]) ),
                                              min($A[$ixi  ,$iyi  ,$izi-1], $A[$ixi  ,$iyi  ,$izi+1]) ) )) end

@doc WITHIN_DOC
macro within(macroname::String, A::Symbol)
    if     macroname == "@all"     esc(  :($ix<=size($A,1)   && $iy<=size($A,2)   && $iz<=size($A,3)  )  )
    elseif macroname == "@inn"     esc(  :($ix<=size($A,1)-2 && $iy<=size($A,2)-2 && $iz<=size($A,3)-2)  )
    elseif macroname == "@inn_x"   esc(  :($ix<=size($A,1)-2 && $iy<=size($A,2)   && $iz<=size($A,3)  )  )
    elseif macroname == "@inn_y"   esc(  :($ix<=size($A,1)   && $iy<=size($A,2)-2 && $iz<=size($A,3)  )  )
    elseif macroname == "@inn_z"   esc(  :($ix<=size($A,1)   && $iy<=size($A,2)   && $iz<=size($A,3)-2)  )
    elseif macroname == "@inn_xy"  esc(  :($ix<=size($A,1)-2 && $iy<=size($A,2)-2 && $iz<=size($A,3)  )  )
    elseif macroname == "@inn_xz"  esc(  :($ix<=size($A,1)-2 && $iy<=size($A,2)   && $iz<=size($A,3)-2)  )
    elseif macroname == "@inn_yz"  esc(  :($ix<=size($A,1)   && $iy<=size($A,2)-2 && $iz<=size($A,3)-2)  )
    else error("unkown macroname: $macroname. If you want to add your own assignement macros, overwrite the macro 'within(macroname::String, A::Symbol)'; to still use the exising macro within as well call ParallelStencil.FiniteDifferences{1|2|3}D.@within(macroname, A) at the end.")
    end
end

end # Module FiniteDifferences3D

omlins commented 1 year ago

Hi @LaurentPlagne

I noticed that src/FiniteDifferences.jluses Float64 literals like 8.0,4.0,1.0, 0.5, 0.25, 0.125. Is it possible that the slowdown whith Float32 Numbers is related to conversions happening in the kernels. In this case, it could be beneficial to make the literals more generic (e.g. 0.5 -> inv(eltype($A)(2))

Literals are automatically casted at pre compilation time to the right precision by ParallelStencil. This is the reason why we have these literals in the FiniteDifferences module - easier to read and write than for example inv(eltype($A)(2). Thanks for the comment though!

Unfortunately, it is a long standing issue that single precision performance is significantly below double precision performance when using CUDA.jl. Whether it is also a long standing issue for CPU I am not sure, but it but it might well be. In any case, this is not an issue with ParallelStencil, but with the used backend packages for parallelization or julia in general. If you have a good MWE that shows the issue without ParallelStencil, then it would be good to open an issue at the corresponding repository.

omlins commented 1 year ago

The time is consumed by the getindex calls which increases from Float64to Float32 (which makes no sense to me).

@LaurentPlagne It would be great if you could report this finding in the corresponding repository!

LaurentPlagne commented 1 year ago

Thank you for the explanation. FWIW if you take a closer look to the last proposal for Finite differences.jl, is just to replace *0.5 (and friends) by /2 (and friends). So IMHO nicer to read ;)

omlins commented 1 year ago

Thank you for the explanation. FWIW if you take a closer look to the last proposal for Finite differences.jl, is just to replace *0.5 (and friends) by /2 (and friends). So IMHO nicer to read ;)

@LaurentPlagne True, however, ParallelStencil's magic can only replace floats with floats and in integers with integers; also, divisions are not converted..

LaurentPlagne commented 1 year ago

Interesting. For CPU I think that Julia's magic removes all these divisions (and conversions) so /2 is equivalent to *0.5 (but more generic). You say that is it not the case for GPU... good to know.

omlins / ParallelStencil.jl

Thread (CPU) Float32/Float64 performance comparison on miniapp acoustic2D #88