I realized that the FullGridCellList (at least with the Vector{Vector} backend) is algorithmically almost identical to the method of CellListMap.jl. But CellListMap.jl is slightly faster on a single thread or on multiple threads when modified to use Polyester.jl for threading.
The main difference is that @lmiq is storing the coordinates together with the point indices in the cell lists. This avoids unordered access of the big coordinate array to get the coordinates of the neighbor.
I implemented a similar data structure and made it configurable, as our goal is to have a playground to try out methods.
We now get very similar performance to CellListMap.jl.
Here is a plot showing the speedup against CellListMap.jl on a single thread (Threadripper 3990X):
On 128 threads, we're still slightly slower:
Here is a plot showing the speedup from using PointWithCoordinates on different architectures.
We see the largest speedups (14-15% for a WCSPH interaction on 128 threads!) on the CPU. The Nvidia H100 is also benefiting from this data structure. The RTX 3090 is only getting 0.5-1% faster. For some reason, the AMD Instinct MI210 doesn't like this data structure at all and is performing 2x slower.
I realized that the
FullGridCellList
(at least with theVector{Vector}
backend) is algorithmically almost identical to the method of CellListMap.jl. But CellListMap.jl is slightly faster on a single thread or on multiple threads when modified to use Polyester.jl for threading.The main difference is that @lmiq is storing the coordinates together with the point indices in the cell lists. This avoids unordered access of the big coordinate array to get the coordinates of the neighbor. I implemented a similar data structure and made it configurable, as our goal is to have a playground to try out methods.
We now get very similar performance to CellListMap.jl. Here is a plot showing the speedup against CellListMap.jl on a single thread (Threadripper 3990X):
On 128 threads, we're still slightly slower:
Here is a plot showing the speedup from using
PointWithCoordinates
on different architectures.We see the largest speedups (14-15% for a WCSPH interaction on 128 threads!) on the CPU. The Nvidia H100 is also benefiting from this data structure. The RTX 3090 is only getting 0.5-1% faster. For some reason, the AMD Instinct MI210 doesn't like this data structure at all and is performing 2x slower.