There's a small speedup for small problems on my laptop on 1 thread (Apple M2 Pro):
julia> plot_benchmarks(40, 2)
Original Code
with 6400 points finished in 2.421 ms
GridNeighborhoodSearch
with 6400 points finished in 7.659 ms
NeighborListsNeighborhoodSearch
with 6400 points finished in 2.095 ms
NeighborListsNHS contiguous
with 6400 points finished in 2.126 ms
Original Code
with 23814 points finished in 10.803 ms
GridNeighborhoodSearch
with 23814 points finished in 35.649 ms
NeighborListsNeighborhoodSearch
with 23814 points finished in 9.305 ms
NeighborListsNHS contiguous
with 23814 points finished in 9.486 ms
For larger problems, there is a 2x speedup on 64 threads of a Threadripper 3990X:
julia> plot_benchmarks(100, 2)
Original Code
with 100000 points finished in 2.465 ms
GridNeighborhoodSearch
with 100000 points finished in 6.597 ms
NeighborListsNeighborhoodSearch
with 100000 points finished in 2.536 ms
NeighborListsNHS contiguous
with 100000 points finished in 2.560 ms
Original Code
with 379215 points finished in 21.291 ms
GridNeighborhoodSearch
with 379215 points finished in 24.278 ms
NeighborListsNeighborhoodSearch
with 379215 points finished in 12.852 ms
NeighborListsNHS contiguous
with 379215 points finished in 9.820 ms
No speedup on a single thread (maybe due to more cache per thread?):
julia> plot_benchmarks(100, 2) Original Code with 100000 points finished in 73.701 ms
GridNeighborhoodSearch with 100000 points finished in 230.978 ms
NeighborListsNeighborhoodSearch with 100000 points finished in 90.497 ms
NeighborListsNHS contiguous with 100000 points finished in 81.854 ms
Original Code
with 379215 points finished in 299.921 ms
GridNeighborhoodSearch
with 379215 points finished in 993.680 ms
NeighborListsNeighborhoodSearch
with 379215 points finished in 355.863 ms
NeighborListsNHS contiguous
with 379215 points finished in 328.210 ms
Based on https://github.com/trixi-framework/PointNeighbors.jl/pull/10.
There's a small speedup for small problems on my laptop on 1 thread (Apple M2 Pro):
For larger problems, there is a 2x speedup on 64 threads of a Threadripper 3990X:
No speedup on a single thread (maybe due to more cache per thread?):