omlins / ParallelStencil.jl

Package for writing high-level code for parallel high-performance stencil computations that can be deployed on both GPUs and CPUs
BSD 3-Clause "New" or "Revised" License
312 stars 31 forks source link

Passing `struct`s into `@parallel` functions #53

Open smartalecH opened 2 years ago

smartalecH commented 2 years ago

I have several @parallel functions that each have several arguments (of type Data.Array). To make the code cleaner, I (naively) tried passing a single struct containing all of these arguments. Unfortunately, I get the following error:

ERROR: LoadError: MethodError: no method matching var"@within"(::LineNumberNode, ::Module, ::String, ::Expr)
Closest candidates are:
  var"@within"(::LineNumberNode, ::Module, ::String, ::Symbol) at ~/.julia/packages/ParallelStencil/3flwf/src/FiniteDifferences.jl:153

Fundamentally, I'm assuming this has something to do with how the struct lives in memory and isn't compatible with the the way ParallelStencil distributes Data.Arrays. Is there a way around this? Perhaps there's a more "canonical" way to accomplish this?

Could you also explain the role of the @within macro, so that I can better understand the limitations of @parallel functions (and how to properly use them)?

Thanks for the great package!

luraess commented 2 years ago

Hi @smartalecH, the issue you see passing a struct to a @parallel kernel relates to compliance limitation from CUDA.jl when it comes to passing struct to GPU kernels (from the Module documentation callable from the Julia REPL, kernels declared using @parallel accept Data.Array and Data.Number arguments, and should contain only stencil computations to be performed with one of the submodules ParallelStencil.FiniteDifferences{1D|2D|3D}.

We are currently developing proper support for different kind of data types within ParallelStencil, notably for logical arrays of structs (or of small arrays) which can be either laid out in memory as arrays of structs or as structs of arrays accounting for the fact that each of these allocation approaches has its use cases where it performs best.

The following shows, e.g., how an array of small arrays can be created in future using a purely declarative allocation:

julia> using ParallelStencil

julia> @init_parallel_stencil(package=CUDA, ndims=2)

julia> A = @rand(2, 2, numbertype=Float16, celldims=(2,3))
2×2 CUDA.CuArray{StaticArrays.SMatrix{2, 3, Float16, 6}, 2}:
 [0.877 0.7266 0.007324; 0.5176 0.3496 0.541]  …  [0.8716 0.7246 0.4375; 0.662 0.4355 0.2075]
 [0.5244 0.2056 0.1436; 0.891 0.0757 0.882]       [0.9214 0.5864 0.206; 0.2124 0.1855 0.4756]

julia> A[2,1]
2×3 StaticArrays.SMatrix{2, 3, Float16, 6} with indices SOneTo(2)×SOneTo(3):
 0.5244  0.2056  0.1436
 0.891   0.0757  0.882

julia> A[2,1][2]
Float16(0.891)

Logical arrays of structs will be allocatable in a similar way. Note that this is not yet merged into the master branch of ParallelStencil, but should be ready within a few weeks. In any case, we plan to present this at JuliaCon 2022.

Meanwhile, you will need to manually make sure to comply with the requirements of CUDA.jl. The following document documentation explains one solution: https://cuda.juliagpu.org/stable/tutorials/custom_structs/. A MWE implementing it in ParallelStencil using @parallel_indices function declaration can be found in this gist.

Regarding the @within macro, the REPL-callable doc returns

help?> @within
  @within(macroname::String, A::Symbol)

  Return an expression that evaluates to true if the indices generated by @parallel (module ParallelStencil) point to elements in bounds
  of the selection of A by macroname.

  │ Warning
  │
  │  This macro is not intended for explicit manual usage. Calls to it are automatically added by @parallel where required.

Hope this info helps.

smartalecH commented 2 years ago

Wow, thank you for the very comprehensive response!

relates to compliance limitation from CUDA.jl

This makes sense, thanks.

We are currently developing proper support for different kind of data types within ParallelStencil

Interesting. Let me know if I can help at all. Happy to contribute.

In any case, we plan to present this at JuliaCon 2022.

Very cool. If you get a chance, please post a link to the corresponding paper (and youtube talk once available).

A MWE implementing it in ParallelStencil using @parallel_indices function declaration can be found in this gist.

Thanks for pulling this together. Very helpful!

Hope this info helps.

Very much so, thanks! I'll keep this issue open then at least until the change you mention is merged into master.

smartalecH commented 2 years ago

kernels declared using @parallel accept Data.Array and Data.Number arguments, and should contain only stencil computations to be performed with one of the submodules ParallelStencil.FiniteDifferences{1D|2D|3D}.

On a similar note, suppose I want to initialize a "geometry" array (e.g. if I were to solve the Poisson PDE, ∇⋅c∇u = f and c describes a piecewise constant "geometry"). The GeometryPrimitives package is great for this, and allows me to describe rather sophisticated geometries.

Typically (in C/C++) I try to initialize such arrays using the same formalism as the code that operates on them, so that they get allocated nearest to the respective core (and minimize any broadcasting and communication between cores or processes). Ideally, I would also use a @parallel construct to allocate my c array, but that would require I pass non ParallelStencil data structures (in this case, a geometry "tree" with geometry objects) to that function (which obviously won't work with the current codebase as stated above).

What do you guys typically do in these situations? Have you tried interfacing directly with external geometry libraries, or do you typically start with a native CPU array and broadcast when needed (as done here)?

Thanks for all of the help!

luraess commented 2 years ago

What do you guys typically do in these situations?

Up to now, our approach was to do most of initialisation using native CPU arrays as you are referring to, broadcasting the final result to Data.Array that will be actually used in the performance critical parts. Now, some initialisation could also be performed in a @parallel function, and should be done as such if it becomes a performance bottleneck. But most often one may want to prioritise flexibility over performance for initialisation.

Ultimately, very user is free to pick their preferred workflow, being ultimately able to broadcast the final result to the desired type and backend.

Have you tried interfacing directly with external geometry libraries

Some times, but in the above described fashion.