idaholab / moose

Multiphysics Object Oriented Simulation Environment
https://www.mooseframework.org
GNU Lesser General Public License v2.1
1.71k stars 1.04k forks source link

Need to suppress redundant repartitioning and/or distribution during mesh generation #26195

Open roystgnr opened 9 months ago

roystgnr commented 9 months ago

Reason

Repartitioning is expensive, but we do it after any mesh change anyway because a properly load-balanced mesh pays off in the long run during a simulation, but we shouldn't be doing it after serially-calculated mesh changes that are just going to be followed by other calculated-in-serial mesh changes rather than followed by simulation steps.

@miaoyinb came up with a mesh generation example which runs in ~11 seconds in serial but in 81 seconds on 4 processors, because 70 seconds is spent partitioning and repartitioning every little section of a patterned mesh generator.

To make things more complicated, however, we should be partitioned after mesh changes that are just going to be followed by properly parallelized mesh changes. Consider a mesh extrusion or mesh refinement, which can add orders of magnitude more elements to a mesh, but which can be done in embarrassingly-parallel efficient fashion on a distributed mesh.

Design

My idea for a solution is to try to propagate that "don't bother yet" information backwards through the tree of MeshGenerator objects; if after N steps of generation we have a generator that isn't going to benefit from a properly partitioned/distributed mesh, we might as well tell generator N-1 (and from that generator N-2, and any branches of the tree, etc) not to bother yet. But then if generator N+1 is doing something like an efficiently parallelized extrusion or refinement, generator N doesn't get the "don't bother" message and when it finishes then we do the partitioning.

This might work since we have that opportunity in the typical MeshGenerator constructor. We could add an optional generation_type parameter to MeshGenerator::getMesh and MeshGenerator::getMeshes. It would default to the generator's current generation_type member variable, which would default to (some enum value of) "distributed", but a generator could manually downgrade it, and the input mesh(es) that are downgraded would then be able to see that value in their own MeshGenerator::generation_type member, based on which a generator could disable distribution and/or partitioning accordingly ... or better yet, its superclass could: if we handled that generation_type in buildMeshBaseObject etc perhaps most mesh generators wouldn't even need to be aware of it...

I'm still trying to figure out exactly how to do this best; hence the hand-wavy nature of the design above.

Impact

The enhancement wouldn't change any existing APIs (just ABIs), but it would change existing behavior strongly enough that it's not impossible that it would trigger bugs in input mesh generators.

If I'm missing something important about this problem, then the increased complication from this fix's design could interfere with any better fix ideas we come up with later. But I think at this point I've got a (vague) idea whose increased complication (some optional arguments, and most work done in the superclass) is minimal enough that it's worth a try anyway.

roystgnr commented 9 months ago

The anti-scaling mesh generation case, and input file to replicate it:

run_mpi1 run_mpi4 test_mpi.txt