idaholab / moose

Multiphysics Object Oriented Simulation Environment
https://www.mooseframework.org
GNU Lesser General Public License v2.1
1.73k stars 1.04k forks source link

Hung simulation when trying to load mesh with --distributed-mesh option #28760

Open shikhar413 opened 1 week ago

shikhar413 commented 1 week ago

Bug Description

When trying to load a mesh with --distributed-mesh option on Sawtooth for a job with 4 nodes, I am noticing that MOOSE would get stuck on this step for a very long time.

  Setting Up Undisplaced Mesh
    Preparing Mesh.....................................................................

Older versions of MOOSE do not exhibit this behavior, and it was determined that upgrading the mpi-mpich version in conda from 4.0 to 4.2 caused this issue to manifest

More information about the specifics of the issue can be found in this Slack thread: https://moosedevelopers.slack.com/archives/C01054VRUEM/p1727831556160119

Steps to Reproduce

Mesh block with FileMeshGenerator, and Problem/solve=false and Executioner/type=Steady.

Impact

Unable to run Griffin simulation based on importing a mesh

[Optional] Diagnostics

PBS job on Sawtooth with 4 nodes and 24 MPI processes per node

lindsayad commented 3 days ago

I believe mpich fundamentally doesn't work for multi-node runs on INL HPC due to missing communication protocols. @loganharbour ?

GiudGiud commented 3 days ago

I think it's only if you use containers?

loganharbour commented 3 days ago

I believed Shikhar no longer has this issue with OpenMPI (containerized).

Quick recap on MPICH: conda or no conda, it does not contain the optimized all-to-all routines for use on many ranks. Its implementation involves setting up a port from every rank to every rank. This scales terribly, and will eventually use up all TCP network ports on each node. MVAPICH and OpenMPI have different optimized implantations that treat all-to-all communications in a tree like manner.

Regardless of the above, we shouldn't be using conda environments on HPC hosts.

In addition, @giudgiud, please don't point fingers at containers unless we have examples of issues. This isn't one, and I haven't found one yet despite larger runs being done. Happy to look into them if folks find them. Just trying to keep miscommunication at a minimum.