omlins / ParallelStencil.jl

Package for writing high-level code for parallel high-performance stencil computations that can be deployed on both GPUs and CPUs
BSD 3-Clause "New" or "Revised" License
322 stars 38 forks source link

Performance regression with nested functions on v0.13 using CPU #154

Closed albert-de-montserrat closed 3 months ago

albert-de-montserrat commented 3 months ago

We are having a lot of performance regressions in JustRelax.jl after switching from v0.12.1 to v.0.13.0. For starters, our CI went from 12mins to nearly 4h.

In a more concrete example, the first time step of the heat diffusion solve (3D, with a 32^2 resolution), using 4 threads on both cases:

What did change in 0.13 ? Did Polyester actually became the default?

albert-de-montserrat commented 3 months ago

The regression was introduced by #152

albert-de-montserrat commented 3 months ago

This is an MWE that triggers the regression. Looks like that #152 introduced some changes in how indices are dealt with, and they are not well captured by closures anymore:

using ParallelStencil
using ParallelStencil.FiniteDifferences2D
@init_parallel_stencil(Threads, Float64, 2)

@parallel_indices (i,j) function foo1!(A::AbstractArray{T,2}, B::AbstractArray{T,2}) where T
    A[i, j] = B[i+1,j] - B[i,j]

    nothing
end

@parallel_indices (i,j) function foo2!(A::AbstractArray{T,2}, B::AbstractArray{T,2}) where T
    dx(B) = B[i+1,j] - B[i,j]

    A[i, j] = dx(B)
    nothing
end

n = 32
A = zeros(n, n)
B = zeros(n, n)

r = 1:n-1, 1:n-1

@b  @parallel $r foo1!($(A, B)...) # 2.489 μs (31 allocs: 4.031 KiB)
@b  @parallel $r foo2!($(A, B)...) # 19.000 μs (3906 allocs: 64.578 KiB)
omlins commented 3 months ago

@albert-de-montserrat : sorry for the delay, I have been at JuliaCon and vacation. I will try to fix this ASAP

omlins commented 3 months ago

The pull request #155 fixes the issue:

julia> @belapsed  @parallel $r foo1!($(A_ref, B)...) # 2.489 μs (31 allocs: 4.031 KiB)
2.6308888888888887e-6

julia> @belapsed  @parallel $r foo2!($(A, B)...) # 19.000 μs (3906 allocs: 64.578 KiB)
2.6406666666666666e-6

julia> A_ref == A
true

and with polyester we get:

julia> @belapsed  @parallel $r foo1!($(A_ref, B)...) # 2.489 μs (31 allocs: 4.031 KiB)
5.748633879781421e-7

julia> @belapsed  @parallel $r foo2!($(A, B)...) # 19.000 μs (3906 allocs: 64.578 KiB)
5.563548387096774e-7

julia> A_ref == A
true
omlins commented 3 months ago

Solved in v0.13.2