Efficient cache-aware and cache-oblivious implementations

smartalecH commented 1 year ago

There's a great example, diffusion2D_shmem_novis.jl, which shows how to leverage the shared memory of a GPU (thereby taking advantage of spatial locality common with so many stencil codes).

It would be great if we could abstract this slightly such that we could implement cache-aware algorithms on the CPU using the same kernel code as the GPU. In essence, one divides the domain into chunks (or blocks, on a GPU) and then loops over these chunks assigning workers as needed. Unfortunately, one doesn't have direct access to the L1 cache within a CPU (otherwise it wouldn't be called a cache). However it's well known that by tiling the arrays such that they fit into cache, the compiler will keep the data in cache, allowing you to leverage spatial locality (as is often done with CPU-based tiled matrix multiplication).

Currently, the CPU loops over individual indices... so even if someone manually chunked the domain, there would be quite a lot of cache contention. Instead, it would be better to loop over tiles in this case. (there's a bit of extra working ensuring that the same kernel code could be used on a CPU and a GPU because of the extra worker hierarchy within GPUs).

Even better would be an abstract interface to a cache-oblivious implementation that "just works" on CPUs and GPUs (i.e. a recursive technique could be used to automatically detect cache sizes). Cache-oblivious algorithms are great for a number of reasons besides portability. Namely, they tend to leverage all levels of the cache (L2, L3, etc) better than cache-aware algorithms.

I realize this is a somewhat grandiose vision... I'm just throwing it out there for people to toy with 🙂

omlins commented 1 year ago

Thanks for your thoughts. We will address cpu optimisations in the LV backend.

omlins commented 4 weeks ago

We have improved CPU performance using Polyester

omlins / ParallelStencil.jl

Efficient cache-aware and cache-oblivious implementations #79