Implement faster matrix assembly

kronbichler commented 8 months ago

Here is an attempt that should make the matrix assembly on the patch considerably faster. Instead of trying to use the matrix-free infrastructure directly (that needs to filter out different cases), I opted to choose the entry point to FEEvaluation with the same ingredients as FEValues, i.e., reinit via cell_iterator. This code is wasteful because it computes the full column, rather than only the ones selected for the patch. Nonetheless, the underlying complexities and data structures are such that on my machine, this brings the assembly cost down to the point where inserting the entries into the matrix takes more than 70% of the time in this function.

marcfehling commented 8 months ago

These are the relevant timings for the Stokes 3D reference scenario in the multigrid part. The machine I ran the experiments on was shared, so timings are potentially larger than in reality. For this scenario, it shows a speed-up of 65% on the setup of the smoother preconditioners, and a speed-up of 25% on total runtime.

Before:

Running with dealii & Trilinos on 96 MPI rank(s)...
| Section                    | no. calls |   min time  rank |   avg time |   max time  rank |
+----------------------------------------+------------------+------------+------------------+
| full_cycle                 |         1 |     91.23s    88 |      91.8s |     92.15s     9 |
| mg_setup_levels            |         1 |     41.37s     1 |     41.39s |     41.42s    67 |
|   mg_setup_level_smoothers |        13 |     39.02s    58 |     39.08s |     39.14s    30 |  <---
| mg_reinit_transfer         |         1 |    0.5896s    67 |    0.6178s |    0.6407s     1 |
| mg_solve                   |         1 |     44.07s    60 |     44.25s |     44.29s    36 |

After:

Running with dealii & Trilinos on 96 MPI rank(s)...
| Section                    | no. calls |   min time  rank |   avg time |   max time  rank |
+----------------------------------------+------------------+------------+------------------+
| full_cycle                 |         1 |     66.44s    23 |     67.03s |     67.35s    29 |
| mg_setup_levels            |         1 |     15.99s    26 |     16.01s |     16.04s    47 |
|   mg_setup_level_smoothers |        13 |     13.54s    13 |     13.58s |     13.65s    32 |  <---
| mg_reinit_transfer         |         1 |    0.6313s    47 |    0.6539s |    0.6738s    25 |
| mg_solve                   |         1 |     44.37s    58 |     44.54s |     44.59s    64 |

marcfehling commented 8 months ago

I will try this approach as well for the Poisson assembly.

kronbichler commented 8 months ago

Excellent! This looks more reasonable (assembly should rarely take more than 10% of run time in a FEM solver). If I understand the numbers correctly, mg_setup_level_smoothers now also contains the part where the additive Schwarz smoother factorizes its matrices as well, so the pure assembly time improved even more. I am glad this will make life slightly easier when doing the experiments.

marcfehling / hpbox

Implement faster matrix assembly #12