omlins / ParallelStencil.jl

Package for writing high-level code for parallel high-performance stencil computations that can be deployed on both GPUs and CPUs
BSD 3-Clause "New" or "Revised" License
311 stars 31 forks source link

Improving loop kernel speed #77

Closed smartalecH closed 1 year ago

smartalecH commented 1 year ago

I have two versions of a relatively simple (2D) finite-difference code, one written in c++, the other using ParallelStencil. The c++ code is consistently 6x-8x faster than the Julia implementation (on a single thread of my M1 cpu). I'm using "cells/second" as my benchmarking metric. It's slightly different than the throughput metric (T) used in the PS publications, but still proportional to the same quantity and a little more meaningful in this context.

I have two theories as to why the Julian code is slower, but I'd like to get some feedback.

  1. branching. First, my c++ implementation has a pretty tight loop. No branching whatsoever. All of the branching is outside the loop (resulting in some significant code repetition). In my PS implementation, I'm using a @parallel_indices kernel that does have an if-statement to ensure the loop indices don't step out of bounds (I'm using a staggered grid, and not all arrays have the same size). I did some preliminary work to refactor the edges of each loop so that I could factor out the branching, but I haven't seen any significant speed improvements. Are there any other case statements hidden within the macros I may not be aware of?

  2. SIMD support. My c++ variant has explicit simd optimizations using some compiler macros. Does PS help the compiler out in this regard? I realize this is tricky with the branching I had earlier... but before I arduously refactor all my code, I want to make sure PS is even simd aware.

Thanks for your help!

omlins commented 1 year ago

PS CPU implementation for small problems is currently missing a lot of optimisations as e.g. optimizations for simd. As noted in an earlier message to you, a LoopVectorization.jl backend is on the roadmap to tackle this. So I would right now just not care about the performance of CPU implementation for small problems until we have this new backend.

The current PS CPU implementation does quite well for large problems. If you increase the problem size in both of your codes, then at some point the performance should in general become similar...

smartalecH commented 1 year ago

Got it! As always, thanks for your help, @omlins!