Closed kaipartmann closed 4 months ago
IMHO, it's important to find out the arithmetic intensity for this part of operations. At least, to find it out, whether it is compute bound or memory bound. This is very important, because normally the compute bound computations scale w.r.t. the number of threads, e.g. matrix matrix multiplication. However, the memory bound computations usually do not scale w.r.t. the number of threads, because of the limited memory bandwidth. One typical example is matrix vector multiplication.
Thank you @xwuupb for your valuable insights regarding the arithmetic intensity of the time-consuming portions of the package. I apologize for the delay in my response; it took some time to thoroughly investigate and conduct an in-depth analysis and code profiling to address your concerns.
@xwuupb, I would greatly value your feedback on the benchmark results. If you have any further observations, recommendations, or questions regarding the presented findings, please do not hesitate to share them.
Profiling of the most computationally intensive functions, notably the compute_forcedensity!
functions, suggests that these operations are memory-bound. This limitation arises from the extensive reading and writing activities within the computation process. More details about these functions can be found in the source code, e.g.:
https://github.com/kaipartmann/Peridynamics.jl/blob/5632d539d36053bd25680bedb84067e45e7fa3b5/src/bond_based.jl#L189-L212
One noteworthy observation is that the performance of the package varies significantly across different features and simulation types. Each of these features entails running distinct code segments, resulting in divergent performance characteristics. Therefore, I have designed 5 benchmarks that cover all features of the package:
bbvv.jl
)bbdr.jl
)bbmmvv.jl
)cpdvv.jl
)contact.jl
)The benchmarks and the results can be found in the v0.2.0-benchmark
branch.
bbvv
, bbdr
, and bbmmvv
when using 8 to 64 threads. This is notable even with the export of b_int
and velocity
disabled for benchmarking. Exporting fewer VTK files to reduce overhead is not a viable option due to the fact that crack propagation often occurs within just a few timesteps. Capturing this phenomenon accurately is crucial for creating meaningful result videos. You can review the relevant code section here:
https://github.com/kaipartmann/Peridynamics.jl/blob/396e41ab28958533eafe662e051ce42b8a8369f2/src/io.jl#L177-L183cpdvv
, our analysis indicates what we already know - that the CPD (continuum-kinematics-based peridynamics) model is inherently expensive. Consequently, VTK export operations have a minimal impact on its performance.bbvv
)bbdr
)bbmmvv
)cpdvv
)contact
)Great! Although the bottleneck functions are bound by the memory bandwidth, there are still many optimization techniques can be used to approach the hardware limits. See you on next Monday in PC2 Hackathon.
Very good! I'm looking forward to the PC2 Hackathon! See you on Monday!
Using thread id may expose the code to data races I believe. Unless the threads use static scheduling. There was a longish discussion on the discourse.
nchunks = 20
Threads.@threads for ichunk in 1:nchunks
for bond_id in partitioned_bonds[ichunk]
i, j = bonds[bond_id]
results = some_calculation()
b_int[:,i, ichunk] += results # <-- ichunk = third dimension
b_int[:,j, ichunk] -= results # <-- ichunk = third dimension
end
end
Perhaps?
Sorry, it would appear I misred your code. You are not actually using the thread id.
Thank you, @PetrKryslUCSD, for pointing out that discussion and the problems with threadid()
. Unfortionately, Peridynamics.jl currently uses kind of a mixed approach, as the number of chunks is hardcoded as nchunks = nthreads()
and called n_threads
:
https://github.com/kaipartmann/Peridynamics.jl/blob/805f6a6343981cda75a72b889bf4d98703048910/src/bond_based.jl#L138 https://github.com/kaipartmann/Peridynamics.jl/blob/805f6a6343981cda75a72b889bf4d98703048910/src/continuum_based.jl#L131
I recently attended a Hackathon at PC² where the performance problems of the current multithreading approach could be worked out.
body.b_int
:
https://github.com/kaipartmann/Peridynamics.jl/blob/5632d539d36053bd25680bedb84067e45e7fa3b5/src/bond_based.jl#L200-L205 The same approach is used for body.n_active_family_members
:
https://github.com/kaipartmann/Peridynamics.jl/blob/805f6a6343981cda75a72b889bf4d98703048910/src/bond_based.jl#L211-L212 This approach avoids race conditions, but by having all threads write to the same array, the caching protocol forces them to update the entire cache block every time a value in the array is changed, which produces a remarkable overhead. Furthermore, this is very RAM intensive since each thread has a full version of the array with all indices that it doesn't even need.I created a shorter and simplified version of the package, with just the functionality of the bbvv
-benchmark: BBVV.jl
During the hackathon, we were able to improve the multithreading performance significantly.
Within BBVV.jl, @threads :static
and a thread local storage of b_int
and n_active_family_members
is used. Regarding the discussion, it may be interesting to see the performance of a chunk-based approach with @threads :dynamic
in comparison. I may try that out.
This is officially fixed with v0.3
.
Current approach
The bottleneck of Peridynamics simulations lies in the computation of the force density
b_int
(see thecompute_forcedensity!
methods). In the serial case, this computation looks similar to the following code:Since the computation writes into columns
i
andj
of theb_int
matrix, it cannot be easily parallelized. Therefore, the force density is extended by one dimension, and each thread operates on its own 2D matrix. Finally, all the results are summed up into the first column efficiently. Currently, this and other computations are implemented usingThreads.@threads
, where each thread operates on its own partition of the corresponding vector.Benchmark
However, the scaling for more than ten threads is not good, even with the improvements in the summation (see https://github.com/kaipartmann/Peridynamics.jl/pull/9#issue-1463421762).
Roadmap
The main goal is to make the package suitable for larger HPC simulations. The significant RAM requirements for simulations with
ContinuumBasedMaterial
are also a limiting factor for the multithreading approach. Therefore, an effective solution could be using a distributed approach with MPI.jl in combination with Multithreading (similar to Trixi.jl). Alternatively, other options such as Dagger.jl look very promising.