idaholab / moose

Multiphysics Object Oriented Simulation Environment
https://www.mooseframework.org
GNU Lesser General Public License v2.1
1.78k stars 1.05k forks source link

ParMETIS called through MOOSE may deliver broken partitioning #25691

Closed YaqiWang closed 3 months ago

YaqiWang commented 1 year ago

Reason

As showed in https://mooseframework.inl.gov/source/partitioner/PetscExternalPartitioner.html, where a 10-by-10 regular grid is partitioned into 8 subdomains and ParMETIS gives discontinuous subdomains. Reported by Jan Vermaak and @jthano that this is not typically seen with their codes. This bad partitioning can hurt the performance, which often may not be detected by users. ParMETIS is our current default partitioner in parallel runs, so we need to look into this.

Design

@roystgnr mentioned that calling ParMETIS directly with libMesh for that 10-by-10 regular gird did not produce this broken partitioning, which indicates that this is likely caused by how we call ParMETIS in our MOOSE workflow. A separate point is that If ParMETIS is a super set of METIS, we might want to replace METIS even in serial runs. Tag @lindsayad

Impact

Should not affect the converged solutions but can improve the performance for calculations with bad partitionings.

roystgnr commented 1 year ago

ParMETIS is our current default partitioner in parallel runs

Are you calling Parmetis via a PetscExternalPartitioner? If so then I'd not call that a "default"; it's something that has to be manually specified in the input file, right? Parmetis is the default when a DistributedMesh is partitioned, but when that's the case it's called via the libMesh interface, and the results there are perhaps suboptimal (they seem to depend a bit too much on the initial setup, which makes me think we should be defaulting to a space-filling curve rather than element numbering for the initial iterate...), but they're not the utter garbage we're seeing in that 8-subdomain PetscExternalPartitioner test case.

The trouble with the PetscExternalPartitioner case is ... I honestly don't see what we should or could be doing differently in MOOSE. PETSc provides us with this nice partitioner-independent API, and if we pass "ptscotch" as a string to that API then we get an excellent partitioning whereas if we pass "parmetis" then we get garbage. Maybe there's some setup that we should be doing that Parmetis depends on but that party/chaco/ptscotch all ignore?

oanaoana commented 5 months ago

@YaqiWang, so we fixed this in the framework by adding a switch to ptscotch in cases where the mesh is too small for parmetis to perform. I have trouble with a Griffin test griffin/radiation_transport/test/tests/iqs/twigl_multischeme/twigl_step.i and it seems to actually default to the Libmesh Parmetis partitioner, not the Petsc one. Do you mind if I change the mesh block in the test to avoid any mix up with the PETSc one

[Mesh]
  [cmg]
    type = CartesianMeshGenerator
    dim = 2
    dx = '30 40 30'
    dy = '30 40 30'
    ix = '1 1 1'
    iy = '1 1 1'
    subdomain_id =
     '1 2 1
      2 3 1
      1 1 1'
  []
  [Partitioner]
    type = LibmeshPartitioner
    partitioner = parmetis
  []
[]