[Proof of Concept] `Polyester.@batch` for threading

efaulhaber commented 2 months ago

Have you considered using Polyester.jl for threading instead of Threads? I always use the code in this PR when I want to do fair benchmarks of my implementations against CellListMap.jl. I get consistently better performance with this PR over main.

Here is a very cheap benchmark that is running one time step of an n-body simulation. This is the code: https://github.com/trixi-framework/PointNeighbors.jl/blob/main/benchmarks/n_body.jl#L24-L33

This is on an AMD Ryzen Threadripper 3990X using 128 threads:

This is a minimum speedup of 1.6x and a speedup of 4.5x for the largest problem size.

Here is a real-life benchmark, running one particle interaction loop of the Weakly Compressible SPH method:

This is a speedup of 2x for most problem sizes.

lmiq commented 2 months ago

I´ve experimented with Polyerster a few times and, indeed, if the mapped loop is tight it is faster. However, I have the understanding that Polyester does not compose well if the function is called from within another threaded code. I think it will just fill the threads and compete for the resources. That's why I never stick with it.

Concerning the benchmark: is that using CellListMap? That was not clear to me.

One thing is that the way the inner loop of the map_pairwise CellListMap is implemented, which is very general for type of function being provided by the user, can certainly no be optimal in most specific cases. I have thought previously of exposing the inner loop as part of the interface, so the user could implement at will, possiblly taking advantage of vectorization if the function being mapped allows so. But for my applications that was not a priority, so it stayed as it is.

efaulhaber commented 2 months ago

Concerning the benchmark: is that using CellListMap? That was not clear to me.

Yes, foreach_point_neighbor in the code I linked above just forwards to map_pairwise! in this case.

lmiq commented 2 months ago

Now that I think, probably what @batch is doing there is adding @inbounds, @fastmath. I don't think that performance difference is associated to the threading per se.

efaulhaber commented 2 months ago

But wouldn't we see a similar speedup on a single thread then? main:

CellListMap.jl
with 10x10x10 = 1000 particles finished in 1.900 ms

CellListMap.jl
with 16x16x16 = 4096 particles finished in 9.661 ms

CellListMap.jl
with 25x25x25 = 15625 particles finished in 41.616 ms

CellListMap.jl
with 40x40x40 = 64000 particles finished in 184.198 ms

CellListMap.jl
with 63x63x63 = 250047 particles finished in 757.062 ms

CellListMap.jl
with 101x101x101 = 1030301 particles finished in 3.271 s

CellListMap.jl
with 160x160x160 = 4096000 particles finished in 13.483 s

Polyester.jl:

CellListMap.jl
with 10x10x10 = 1000 particles finished in 1.894 ms

CellListMap.jl
with 16x16x16 = 4096 particles finished in 9.653 ms

CellListMap.jl
with 25x25x25 = 15625 particles finished in 41.475 ms

CellListMap.jl
with 40x40x40 = 64000 particles finished in 183.926 ms

CellListMap.jl
with 63x63x63 = 250047 particles finished in 754.712 ms

CellListMap.jl
with 101x101x101 = 1030301 particles finished in 3.264 s

CellListMap.jl
with 160x160x160 = 4096000 particles finished in 13.458 s

lmiq commented 2 months ago

I don´t see the difference in performance you report here, really. For example, I´m running the examples of this page: https://m3g.github.io/CellListMap.jl/stable/LowLevel/#Histogram-of-distances

The forces example, which is probably the most similar to what you are computing in your tests, results in:

Current stable CellListMap 0.9.4:

julia> @btime map_pairwise!(
           $((x,y,i,j,d2,forces) -> calc_forces!(x,y,i,j,d2,mass,forces)),
           $forces,$box,$cl
       );
  34.576 ms (76 allocations: 13.74 MiB)

and with your PR:

julia> @btime map_pairwise!(
           $((x,y,i,j,d2,forces) -> calc_forces!(x,y,i,j,d2,mass,forces)),
           $forces,$box,$cl
       );
  34.173 ms (82 allocations: 13.75 MiB)

In other examples there might be a slight improvement in perfomance with @batch, but very minor. Maybe this is very dependent on the CPUs used?

Could you test one of these examples in your machine?

lmiq commented 1 month ago

Offtopic: can you please send me the link of your talk at JuliaCon about this when it is available? Thanks.

efaulhaber commented 1 month ago

https://www.youtube.com/live/V7FWl4YumcA?si=XIKhTj8dfmon0rwk&t=4669

But we only quickly mentioned the neighborhood search.

efaulhaber commented 1 month ago

Back to the topic: I was confused as well, since Polyester.@batch is actually slower in our code for large problems. So I repeated the benchmark with Threads.@threads and Threads.@threads :static and found this:

So Threads.@threads :static is just as fast as Polyester.@batch, except for very small problems, exactly as I would expect. Interestingly, slapping @threads :static to the serial loop is significantly faster than the nbatches approach in main. This is a 2x speedup for 1M particles and a 4.7x speedup for 65M particles in the cheap n-body benchmark.

But as you already noticed, the difference becomes smaller with more neighbors. In this example, I have a search radius of 3x the particle spacing, which gives me ~100 neighbors with ~3 particles per cell. When I increase the search radius to 10x the particle spacing, the speedup for 1M particles goes down from 2x to 1.11x.

lmiq commented 1 month ago

Well, I´ll have to investigate what´s going on. The issue is that I don´t see these performance differences at all in the hardware I have, with my test functions. For instance, this is what I get with the polyester version:

julia> using CellListMap                                                                                                                                                                                                                                                    

julia> x, box = CellListMap.xatomic(10^4);                                                                                                                                                                                                                                  

julia> sys = ParticleSystem(positions=x, unitcell=box.input_unit_cell.matrix, output=0.0, cutoff=12.0);                                                                                                                                                                     

julia> @btime map_pairwise($((x,y,i,j,d2,out) -> out += inv(d2)), $sys)
18.444 ms (87 allocations: 16.14 KiB)
75140.7942005164

With current version:

julia> using CellListMap

julia> x, box = CellListMap.xatomic(10^4);

julia> sys = ParticleSystem(positions=x, unitcell=box.input_unit_cell.matrix, output=0.0, cutoff=12.0);

julia> @btime map_pairwise($((x,y,i,j,d2,out) -> out += inv(d2)), $sys)
  18.692 ms (87 allocations: 16.14 KiB)
77202.5687995922

efaulhaber commented 1 month ago

This is not using the map_pairwise! taking a CellListPair, right? I only changed this definition.

lmiq commented 1 month ago

Oh, right... sorry. But in that case I get the opposite result, the current version is faster:

with polyester:

julia> x, box = CellListMap.xatomic(10^4);

julia> y, _ = CellListMap.xatomic(10^4);

julia> sys = ParticleSystem(xpositions=x, ypositions=y, unitcell=2*box.input_unit_cell.matrix, output=zeros(length(x)), cutoff=12.0);

julia> @btime map_pairwise($((x,y,i,j,d2,out) -> begin out[i] += inv(d2); return out end), $sys);
  46.463 ms (43 allocations: 7.25 KiB)

julia> Threads.nthreads()
6

with current version:

julia> x, box = CellListMap.xatomic(10^4);

julia> y, _ = CellListMap.xatomic(10^4);

julia> sys = ParticleSystem(xpositions=x, ypositions=y, unitcell=2*box.input_unit_cell.matrix, output=zeros(length(x)), cutoff=12.0);

julia> @btime map_pairwise($((x,y,i,j,d2,out) -> begin out[i] += inv(d2); return out end), $sys);
  23.328 ms (93 allocations: 16.33 KiB)

So I'll really will have to dig further into your example to understand what's going on.

efaulhaber commented 1 month ago

Your benchmark on a Threadripper 3990X: map_pairwise_parallel!:

julia> @btime map_pairwise($((x,y,i,j,d2,out) -> begin out[i] += inv(d2); return out end), $sys);
  8.930 ms (286 allocations: 54.41 KiB)

map_pairwise_serial! with @batch:

julia> @btime map_pairwise($((x,y,i,j,d2,out) -> begin out[i] += inv(d2); return out end), $sys);
  6.020 ms (53 allocations: 8.59 KiB)

map_pairwise_serial! with Threads.@threads :static:

julia> @btime map_pairwise($((x,y,i,j,d2,out) -> begin out[i] += inv(d2); return out end), $sys);
  3.438 ms (698 allocations: 132.58 KiB)

lmiq commented 1 month ago

Well, definitelly not what I see here. This is another machine (a fairly standard one):

With @batch:

julia> @btime map_pairwise($((x,y,i,j,d2,out) -> begin out[i] += inv(d2); out end), $sys); 
 63.056 ms (1 allocation: 448 bytes)

with current stable:

julia> @btime map_pairwise($((x,y,i,j,d2,out) -> begin out[i] += inv(d2); out end), $sys);
  21.881 ms (117 allocations: 19.95 KiB)

With:

julia> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8a (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD EPYC 7282 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver2)
Threads: 8 default, 0 interactive, 4 GC (on 16 virtual cores)

(with only @threads :static in the serial loop probably the result is wrong, unless you did something to accumulate the result per thread and reduce later)

efaulhaber commented 1 month ago

(with only @threads :static in the serial loop probably the result is wrong, unless you did something to accumulate the result per thread and reduce later)

I skipped the reduce in all versions and only returned 0.

m3g / CellListMap.jl

[Proof of Concept] `Polyester.@batch` for threading #104