Closed kronbichler closed 8 months ago
These are the relevant timings for the Stokes 3D reference scenario in the multigrid part. The machine I ran the experiments on was shared, so timings are potentially larger than in reality. For this scenario, it shows a speed-up of 65% on the setup of the smoother preconditioners, and a speed-up of 25% on total runtime.
Before:
Running with dealii & Trilinos on 96 MPI rank(s)...
| Section | no. calls | min time rank | avg time | max time rank |
+----------------------------------------+------------------+------------+------------------+
| full_cycle | 1 | 91.23s 88 | 91.8s | 92.15s 9 |
| mg_setup_levels | 1 | 41.37s 1 | 41.39s | 41.42s 67 |
| mg_setup_level_smoothers | 13 | 39.02s 58 | 39.08s | 39.14s 30 | <---
| mg_reinit_transfer | 1 | 0.5896s 67 | 0.6178s | 0.6407s 1 |
| mg_solve | 1 | 44.07s 60 | 44.25s | 44.29s 36 |
After:
Running with dealii & Trilinos on 96 MPI rank(s)...
| Section | no. calls | min time rank | avg time | max time rank |
+----------------------------------------+------------------+------------+------------------+
| full_cycle | 1 | 66.44s 23 | 67.03s | 67.35s 29 |
| mg_setup_levels | 1 | 15.99s 26 | 16.01s | 16.04s 47 |
| mg_setup_level_smoothers | 13 | 13.54s 13 | 13.58s | 13.65s 32 | <---
| mg_reinit_transfer | 1 | 0.6313s 47 | 0.6539s | 0.6738s 25 |
| mg_solve | 1 | 44.37s 58 | 44.54s | 44.59s 64 |
I will try this approach as well for the Poisson assembly.
Excellent! This looks more reasonable (assembly should rarely take more than 10% of run time in a FEM solver). If I understand the numbers correctly, mg_setup_level_smoothers
now also contains the part where the additive Schwarz smoother factorizes its matrices as well, so the pure assembly time improved even more. I am glad this will make life slightly easier when doing the experiments.
Here is an attempt that should make the matrix assembly on the patch considerably faster. Instead of trying to use the matrix-free infrastructure directly (that needs to filter out different cases), I opted to choose the entry point to
FEEvaluation
with the same ingredients asFEValues
, i.e., reinit viacell_iterator
. This code is wasteful because it computes the full column, rather than only the ones selected for the patch. Nonetheless, the underlying complexities and data structures are such that on my machine, this brings the assembly cost down to the point where inserting the entries into the matrix takes more than 70% of the time in this function.