kaipartmann / Peridynamics.jl

A Julia package for parallel peridynamics simulations
https://kaipartmann.github.io/Peridynamics.jl/
MIT License
33 stars 8 forks source link

HPC suitability #14

Open kaipartmann opened 11 months ago

kaipartmann commented 11 months ago

Current approach

The bottleneck of Peridynamics simulations lies in the computation of the force density b_int (see the compute_forcedensity! methods). In the serial case, this computation looks similar to the following code:

for (i,j) in bonds
    results = some_calculation()
    b_int[:,i] += results # contribution for point i
    b_int[:,j] -= results # contribution for point j
end

Since the computation writes into columns i and j of the b_int matrix, it cannot be easily parallelized. Therefore, the force density is extended by one dimension, and each thread operates on its own 2D matrix. Finally, all the results are summed up into the first column efficiently. Currently, this and other computations are implemented using Threads.@threads, where each thread operates on its own partition of the corresponding vector.

#--- main computation
@threads for tid in 1:nthreads()
    for bond_id in partitioned_bonds[tid] 
        i, j = bonds[bond_id]
        results = some_calculation()
    b_int[:,i,tid] += results # <-- tid = third dimension
    b_int[:,j,tid] -= results # <-- tid = third dimension
    end
end

#--- reduction into first tid
# single_tids::Vector{Tuple{Int,Int}} 
# --> vector containing tuples with (<point id>, <thread id>) 
#     for all points where the computations occur on only one thread (except 1)
@threads for (point_id, tid) in single_tids
    b_int[:,point_id,1] = b_int[:,point_id,tid]
end
# multi_tids::Vector{Tuple{Int,Vector{Int}}
# --> vector containing tuples with (<point id>, <vector of all thread id's except 1>)
#     for all points where the computations occur on multiple threads
@threads for (point_id, tids) in multi_tids
    for tid in tids
        b_int[:,point_id,1] += b_int[:,point_id,tid]
    end
end

#---
# usage of b_int[:,:,1] ...

Benchmark

However, the scaling for more than ten threads is not good, even with the improvements in the summation (see https://github.com/kaipartmann/Peridynamics.jl/pull/9#issue-1463421762).

nthreads btime [s] speedup
1 293.12 1
2 150.98 1.94
8 46.61 6.29
12 38.6 7.59
16 35.93 8.16
24 37.95 7.72
32 49.7 5.9

Roadmap

The main goal is to make the package suitable for larger HPC simulations. The significant RAM requirements for simulations with ContinuumBasedMaterial are also a limiting factor for the multithreading approach. Therefore, an effective solution could be using a distributed approach with MPI.jl in combination with Multithreading (similar to Trixi.jl). Alternatively, other options such as Dagger.jl look very promising.

xwuupb commented 10 months ago

IMHO, it's important to find out the arithmetic intensity for this part of operations. At least, to find it out, whether it is compute bound or memory bound. This is very important, because normally the compute bound computations scale w.r.t. the number of threads, e.g. matrix matrix multiplication. However, the memory bound computations usually do not scale w.r.t. the number of threads, because of the limited memory bandwidth. One typical example is matrix vector multiplication.

kaipartmann commented 10 months ago

Thank you @xwuupb for your valuable insights regarding the arithmetic intensity of the time-consuming portions of the package. I apologize for the delay in my response; it took some time to thoroughly investigate and conduct an in-depth analysis and code profiling to address your concerns.

@xwuupb, I would greatly value your feedback on the benchmark results. If you have any further observations, recommendations, or questions regarding the presented findings, please do not hesitate to share them.

Benchmarks for Peridynamics.jl@v0.2.0

TL;DR

Memory-Bound Operations

Profiling of the most computationally intensive functions, notably the compute_forcedensity! functions, suggests that these operations are memory-bound. This limitation arises from the extensive reading and writing activities within the computation process. More details about these functions can be found in the source code, e.g.: https://github.com/kaipartmann/Peridynamics.jl/blob/5632d539d36053bd25680bedb84067e45e7fa3b5/src/bond_based.jl#L189-L212

Heterogeneous Performance

One noteworthy observation is that the performance of the package varies significantly across different features and simulation types. Each of these features entails running distinct code segments, resulting in divergent performance characteristics. Therefore, I have designed 5 benchmarks that cover all features of the package:

The benchmarks and the results can be found in the v0.2.0-benchmark branch.

Key Observations

Results

bond-based, velocity verlet (bbvv)

wall time ![](https://github.com/kaipartmann/Peridynamics.jl/blob/v0.2.0-benchmark/benchmark/img/png/bbvv_walltime.png?raw=true)
speedup ![](https://github.com/kaipartmann/Peridynamics.jl/blob/v0.2.0-benchmark/benchmark/img/png/bbvv_speedup.png?raw=true)
profiling results ![](https://github.com/kaipartmann/Peridynamics.jl/blob/v0.2.0-benchmark/benchmark/img/png/bbvv_profiling.png?raw=true)

bond-based, dynamic relaxation (bbdr)

wall time ![](https://github.com/kaipartmann/Peridynamics.jl/blob/v0.2.0-benchmark/benchmark/img/png/bbdr_walltime.png?raw=true)
speedup ![](https://github.com/kaipartmann/Peridynamics.jl/blob/v0.2.0-benchmark/benchmark/img/png/bbdr_speedup.png?raw=true)
profiling results ![](https://github.com/kaipartmann/Peridynamics.jl/blob/v0.2.0-benchmark/benchmark/img/png/bbdr_profiling.png?raw=true)

bond-based multi-material, velocity verlet (bbmmvv)

wall time ![](https://github.com/kaipartmann/Peridynamics.jl/blob/v0.2.0-benchmark/benchmark/img/png/bbmmvv_walltime.png?raw=true)
speedup ![](https://github.com/kaipartmann/Peridynamics.jl/blob/v0.2.0-benchmark/benchmark/img/png/bbmmvv_speedup.png?raw=true)
profiling results ![](https://github.com/kaipartmann/Peridynamics.jl/blob/v0.2.0-benchmark/benchmark/img/png/bbmmvv_profiling.png?raw=true)

continuum-kinematics-based, velocity verlet (cpdvv)

wall time ![](https://github.com/kaipartmann/Peridynamics.jl/blob/v0.2.0-benchmark/benchmark/img/png/cpdvv_walltime.png?raw=true)
speedup ![](https://github.com/kaipartmann/Peridynamics.jl/blob/v0.2.0-benchmark/benchmark/img/png/cpdvv_speedup.png?raw=true)
profiling results ![](https://github.com/kaipartmann/Peridynamics.jl/blob/v0.2.0-benchmark/benchmark/img/png/cpdvv_profiling.png?raw=true)

contact analysis (contact)

wall time ![](https://github.com/kaipartmann/Peridynamics.jl/blob/v0.2.0-benchmark/benchmark/img/png/contact_walltime.png?raw=true)
speedup ![](https://github.com/kaipartmann/Peridynamics.jl/blob/v0.2.0-benchmark/benchmark/img/png/contact_speedup.png?raw=true)
profiling results ![](https://github.com/kaipartmann/Peridynamics.jl/blob/v0.2.0-benchmark/benchmark/img/png/contact_profiling.png?raw=true)
xwuupb commented 10 months ago

Great! Although the bottleneck functions are bound by the memory bandwidth, there are still many optimization techniques can be used to approach the hardware limits. See you on next Monday in PC2 Hackathon.

kaipartmann commented 10 months ago

Very good! I'm looking forward to the PC2 Hackathon! See you on Monday!

PetrKryslUCSD commented 9 months ago

Using thread id may expose the code to data races I believe. Unless the threads use static scheduling. There was a longish discussion on the discourse.

PetrKryslUCSD commented 9 months ago

https://discourse.julialang.org/t/psa-thread-local-state-is-no-longer-recommended-common-misconceptions-about-threadid-and-nthreads/101274/9

PetrKryslUCSD commented 9 months ago
nchunks = 20
Threads.@threads for ichunk in 1:nchunks
    for bond_id in partitioned_bonds[ichunk] 
        i, j = bonds[bond_id]
        results = some_calculation()
    b_int[:,i, ichunk] += results # <-- ichunk = third dimension
    b_int[:,j, ichunk] -= results # <-- ichunk = third dimension
    end
end

Perhaps?

PetrKryslUCSD commented 9 months ago

Sorry, it would appear I misred your code. You are not actually using the thread id.

kaipartmann commented 9 months ago

Thank you, @PetrKryslUCSD, for pointing out that discussion and the problems with threadid(). Unfortionately, Peridynamics.jl currently uses kind of a mixed approach, as the number of chunks is hardcoded as nchunks = nthreads() and called n_threads: https://github.com/kaipartmann/Peridynamics.jl/blob/805f6a6343981cda75a72b889bf4d98703048910/src/bond_based.jl#L138 https://github.com/kaipartmann/Peridynamics.jl/blob/805f6a6343981cda75a72b889bf4d98703048910/src/continuum_based.jl#L131

I recently attended a Hackathon at PC² where the performance problems of the current multithreading approach could be worked out.

I created a shorter and simplified version of the package, with just the functionality of the bbvv-benchmark: BBVV.jl During the hackathon, we were able to improve the multithreading performance significantly. compute_forcedensity fullsim

Within BBVV.jl, @threads :static and a thread local storage of b_int and n_active_family_members is used. Regarding the discussion, it may be interesting to see the performance of a chunk-based approach with @threads :dynamic in comparison. I may try that out.