kaipartmann commented 11 months ago

Current approach

The bottleneck of Peridynamics simulations lies in the computation of the force density b_int (see the compute_forcedensity! methods). In the serial case, this computation looks similar to the following code:

for (i,j) in bonds
    results = some_calculation()
    b_int[:,i] += results # contribution for point i
    b_int[:,j] -= results # contribution for point j
end

Since the computation writes into columns i and j of the b_int matrix, it cannot be easily parallelized. Therefore, the force density is extended by one dimension, and each thread operates on its own 2D matrix. Finally, all the results are summed up into the first column efficiently. Currently, this and other computations are implemented using Threads.@threads, where each thread operates on its own partition of the corresponding vector.

#--- main computation
@threads for tid in 1:nthreads()
    for bond_id in partitioned_bonds[tid] 
        i, j = bonds[bond_id]
        results = some_calculation()
    b_int[:,i,tid] += results # <-- tid = third dimension
    b_int[:,j,tid] -= results # <-- tid = third dimension
    end
end

#--- reduction into first tid
# single_tids::Vector{Tuple{Int,Int}} 
# --> vector containing tuples with (<point id>, <thread id>) 
#     for all points where the computations occur on only one thread (except 1)
@threads for (point_id, tid) in single_tids
    b_int[:,point_id,1] = b_int[:,point_id,tid]
end
# multi_tids::Vector{Tuple{Int,Vector{Int}}
# --> vector containing tuples with (<point id>, <vector of all thread id's except 1>)
#     for all points where the computations occur on multiple threads
@threads for (point_id, tids) in multi_tids
    for tid in tids
        b_int[:,point_id,1] += b_int[:,point_id,tid]
    end
end

#---
# usage of b_int[:,:,1] ...

Benchmark

However, the scaling for more than ten threads is not good, even with the improvements in the summation (see https://github.com/kaipartmann/Peridynamics.jl/pull/9#issue-1463421762).

nthreads	btime [s]	speedup
1	293.12	1
2	150.98	1.94
8	46.61	6.29
12	38.6	7.59
16	35.93	8.16
24	37.95	7.72
32	49.7	5.9

Roadmap

The main goal is to make the package suitable for larger HPC simulations. The significant RAM requirements for simulations with ContinuumBasedMaterial are also a limiting factor for the multithreading approach. Therefore, an effective solution could be using a distributed approach with MPI.jl in combination with Multithreading (similar to Trixi.jl). Alternatively, other options such as Dagger.jl look very promising.

xwuupb commented 10 months ago

IMHO, it's important to find out the arithmetic intensity for this part of operations. At least, to find it out, whether it is compute bound or memory bound. This is very important, because normally the compute bound computations scale w.r.t. the number of threads, e.g. matrix matrix multiplication. However, the memory bound computations usually do not scale w.r.t. the number of threads, because of the limited memory bandwidth. One typical example is matrix vector multiplication.

kaipartmann commented 10 months ago

Thank you @xwuupb for your valuable insights regarding the arithmetic intensity of the time-consuming portions of the package. I apologize for the delay in my response; it took some time to thoroughly investigate and conduct an in-depth analysis and code profiling to address your concerns.

@xwuupb, I would greatly value your feedback on the benchmark results. If you have any further observations, recommendations, or questions regarding the presented findings, please do not hesitate to share them.

Benchmarks for Peridynamics.jl@v0.2.0

TL;DR

Memory-Bound Operations: \ The bottleneck functions are memory-bound due to extensive read and write operations
Heterogeneous Performance: \ Performance varies significantly across different simulation types and therefore 5 benchmarks are needed to cover all features of the package
Multithreading Limitations: \ Serial IO involving the VTK-export is a large component of computing time, which to my knowledge cannot benefit from multithreading

Memory-Bound Operations

Profiling of the most computationally intensive functions, notably the compute_forcedensity! functions, suggests that these operations are memory-bound. This limitation arises from the extensive reading and writing activities within the computation process. More details about these functions can be found in the source code, e.g.: https://github.com/kaipartmann/Peridynamics.jl/blob/5632d539d36053bd25680bedb84067e45e7fa3b5/src/bond_based.jl#L189-L212

Heterogeneous Performance

One noteworthy observation is that the performance of the package varies significantly across different features and simulation types. Each of these features entails running distinct code segments, resulting in divergent performance characteristics. Therefore, I have designed 5 benchmarks that cover all features of the package:

bond-based, velocity verlet (bbvv.jl)
bond-based, dynamic relaxation (bbdr.jl)
bond-based multi-material, velocity verlet (bbmmvv.jl)
continuum-kinematics-based, velocity verlet (cpdvv.jl)
contact analysis (contact.jl)

The benchmarks and the results can be found in the v0.2.0-benchmark branch.

Key Observations

VTK Export Overhead: \ One significant observation is that VTK export operations account for approximately 50% of the total wall time for simulations such as bbvv, bbdr, and bbmmvv when using 8 to 64 threads. This is notable even with the export of b_int and velocity disabled for benchmarking. Exporting fewer VTK files to reduce overhead is not a viable option due to the fact that crack propagation often occurs within just a few timesteps. Capturing this phenomenon accurately is crucial for creating meaningful result videos. You can review the relevant code section here: https://github.com/kaipartmann/Peridynamics.jl/blob/396e41ab28958533eafe662e051ce42b8a8369f2/src/io.jl#L177-L183
Costly CPD Model: \ In the case of cpdvv, our analysis indicates what we already know - that the CPD (continuum-kinematics-based peridynamics) model is inherently expensive. Consequently, VTK export operations have a minimal impact on its performance.
Initialization Efficiency: \ The initialization phase of the simulation, represented as "init. body" in the performance plots, scales well when compared to the computation of force densities, denoted as "comp. forcedensity" in the plots.
Expensive Contact Force Densities: \ The computation of contact force densities emerges as a particularly expensive aspect of the simulations, demanding careful optimization.
Multithreading Limitations: \ Across all simulations, it is evident that multithreading struggles to scale efficiently. The computationally intensive segments of the code are primarily constrained by memory-bound operations with a lot of reading and writing to large arrays, highlighting the need for alternative approaches such as distributed computing.

![](https://github.com/kaipartmann/Peridynamics.jl/blob/v0.2.0-benchmark/benchmark/img/png/contact_profiling.png?raw=true)

xwuupb commented 10 months ago

Great! Although the bottleneck functions are bound by the memory bandwidth, there are still many optimization techniques can be used to approach the hardware limits. See you on next Monday in PC2 Hackathon.

kaipartmann commented 10 months ago

Very good! I'm looking forward to the PC2 Hackathon! See you on Monday!

PetrKryslUCSD commented 9 months ago

Using thread id may expose the code to data races I believe. Unless the threads use static scheduling. There was a longish discussion on the discourse.

PetrKryslUCSD commented 9 months ago

https://discourse.julialang.org/t/psa-thread-local-state-is-no-longer-recommended-common-misconceptions-about-threadid-and-nthreads/101274/9

PetrKryslUCSD commented 9 months ago

nchunks = 20
Threads.@threads for ichunk in 1:nchunks
    for bond_id in partitioned_bonds[ichunk] 
        i, j = bonds[bond_id]
        results = some_calculation()
    b_int[:,i, ichunk] += results # <-- ichunk = third dimension
    b_int[:,j, ichunk] -= results # <-- ichunk = third dimension
    end
end

Perhaps?

PetrKryslUCSD commented 9 months ago

Sorry, it would appear I misred your code. You are not actually using the thread id.

kaipartmann commented 9 months ago

Thank you, @PetrKryslUCSD, for pointing out that discussion and the problems with threadid(). Unfortionately, Peridynamics.jl currently uses kind of a mixed approach, as the number of chunks is hardcoded as nchunks = nthreads() and called n_threads: https://github.com/kaipartmann/Peridynamics.jl/blob/805f6a6343981cda75a72b889bf4d98703048910/src/bond_based.jl#L138 https://github.com/kaipartmann/Peridynamics.jl/blob/805f6a6343981cda75a72b889bf4d98703048910/src/continuum_based.jl#L131

I recently attended a Hackathon at PC² where the performance problems of the current multithreading approach could be worked out.

False sharing \ Each thread writes on it's own part of the same 3-dimensional array body.b_int: https://github.com/kaipartmann/Peridynamics.jl/blob/5632d539d36053bd25680bedb84067e45e7fa3b5/src/bond_based.jl#L200-L205 The same approach is used for body.n_active_family_members: https://github.com/kaipartmann/Peridynamics.jl/blob/805f6a6343981cda75a72b889bf4d98703048910/src/bond_based.jl#L211-L212 This approach avoids race conditions, but by having all threads write to the same array, the caching protocol forces them to update the entire cache block every time a value in the array is changed, which produces a remarkable overhead. Furthermore, this is very RAM intensive since each thread has a full version of the array with all indices that it doesn't even need.
Data access pattern\ The existing thread assignment strategy for the bond partitions neglects the data locality of the points, inhibiting better data access patterns.
Using carstenbauer/ThreadPinning.jl\ Another performance improvement was achieved by using the ThreadPinning.jl package. Pinning with respect to NUMA domains had the greatest effect on performance.

I created a shorter and simplified version of the package, with just the functionality of the bbvv-benchmark: BBVV.jl During the hackathon, we were able to improve the multithreading performance significantly. compute_forcedensity fullsim

Within BBVV.jl, @threads :static and a thread local storage of b_int and n_active_family_members is used. Regarding the discussion, it may be interesting to see the performance of a chunk-based approach with @threads :dynamic in comparison. I may try that out.

kaipartmann / Peridynamics.jl

HPC suitability #14

Current approach

Benchmark

Roadmap

Benchmarks for Peridynamics.jl@v0.2.0

TL;DR

Memory-Bound Operations

Heterogeneous Performance

Key Observations

Results

bond-based, velocity verlet (`bbvv`)

bond-based, dynamic relaxation (`bbdr`)

bond-based multi-material, velocity verlet (`bbmmvv`)

continuum-kinematics-based, velocity verlet (`cpdvv`)

contact analysis (`contact`)

kaipartmann / Peridynamics.jl

HPC suitability #14

Current approach

Benchmark

Roadmap

Benchmarks for Peridynamics.jl@v0.2.0

TL;DR

Memory-Bound Operations

Heterogeneous Performance

Key Observations

Results

bond-based, velocity verlet (bbvv)

bond-based, dynamic relaxation (bbdr)

bond-based multi-material, velocity verlet (bbmmvv)

continuum-kinematics-based, velocity verlet (cpdvv)

contact analysis (contact)

bond-based, velocity verlet (`bbvv`)

bond-based, dynamic relaxation (`bbdr`)

bond-based multi-material, velocity verlet (`bbmmvv`)

continuum-kinematics-based, velocity verlet (`cpdvv`)

contact analysis (`contact`)