omlins / ParallelStencil.jl

Package for writing high-level code for parallel high-performance stencil computations that can be deployed on both GPUs and CPUs
BSD 3-Clause "New" or "Revised" License
322 stars 38 forks source link

Optimal Data Layout #13

Closed LaurentPlagne closed 3 years ago

LaurentPlagne commented 3 years ago

Congrats and many thanks for your super impressive and useful package !

I have a few questions :

Best,

Laurent

luraess commented 3 years ago

Hi Laurent,

Congrats and many thanks for your super impressive and useful package !

Thanks for your enthusiastic feedback 😄

Regarding your questions:

Is there a doc/videos/forum threads explaining the internal design choices ?

Not yet. However, some early design stages where presented at JuliaCon2019. The package's main current features where presented at JuliaCon2020.

[...] about the internal data layout of fields: are they organized to adapt to the cache hierarchy and SIMD width of computing targets ? Do you think it would make sense ?

Currently ParallelStencil.jl relies on the standard Julia array CuArray data layouts. We plan to implement different advanced optimisations which will also include some changes to the data layout.

The README states that time blocking should not be interesting for real applications. Could you elaborate ? I thought that small blocks with halo may perform several time-steps before communication and may help to reduce the memory bandwidth pressure. Do you think it would make sense ?

We are not aware of time blocking implementations in complex applications delivering significant speedup. It doesn't exclude that if you are tuning a specific application to its limits, time blocking can give you some benefit.

We are closing this issue for offline discussion.