Further optimize smear - Githubissues

When running simulations on very large numbers of MPI nodes, the computation is mostly bottlenecked by the communication in the MPI sweep of smear. Currently, it is written in a way that it loops over the effective cells and there is a couple of reduction calls (for mass, momentum, chemical species and energy) per cell. In principle, the number of reduction calls can be reduced to just two calls, if the values are stored into an array with a length of the number of effective cells.

Plan:

Separate the angular_smear_global subroutines into two parts, for each reduction call (mass+momentum+chemical elements, vs energy).
The new subroutines only return the array of locally integrated quantities. The reduction is done outside the subroutine.

ryosuke-hirai / HORMONE

Further optimize smear #84