idaholab / moose

Multiphysics Object Oriented Simulation Environment
https://www.mooseframework.org
GNU Lesser General Public License v2.1
1.72k stars 1.04k forks source link

MPI on Sawtooth crashing with memory intensive problems #21107

Open harterj opened 2 years ago

harterj commented 2 years ago

Bug Description

Large (memory intensive) simulations with replicated meshes do not work in MPI on INL Sawtooth. Continual MPI-based crashes and memory issues are encountered on Sawtooth, when problems will set up and run locally. Multiple attempts with various distributions of any MPI software have not fixed this issue. Problems are encountered when heat conduction and thermal hydraulics are coupled -- and transfers are required. Reference issue #20099.

Steps to Reproduce

Input files and mesh are attached. The benchmark in reality has 10 axial zones with a power distribution assigned to each zone. This input is simplified, with averaged PDs for a single axial zone. However, it still has all the relap channels, approximately 7000. This problem is a pseudo-transient.

Please compile the "CombinedApp" in MOOSE, as this input needs heat_conduction and thermal_hydraulics. Running in any MPI configuration on HPC using suggested modules do not work.

Compilation steps:

(Sawtooth)

Using split-mesh is not an option, as the required transfers do not work with distributed meshes. Please see above reference for more on that subject.

With PBS scripts on Sawtooth, I found that I could start the problem using 6 procs per node, and the maximum node count was 15, so 90 procs total. Any more than that, I experienced crashes.

I've also tried running this problem with limited processors on a debug node. 4 nodes and 192 procs would crash, and 1 node with 6 procs also crashes during setup.

Files MHTGR350.zip (Mesh is too large to be attached, please email me at jackson.harter@inl.gov so I can share it with you)

Any suggestions are very welcome.

Impact

This is directly hindering work for me with our customers. Suggested solutions are not effective, and this also affects other users of MOOSE who have non-shareable work. The MHTGR350 benchmark is public access, and maybe troubleshooting this will help fix issues for multiple users.

friedmud commented 2 years ago

We need to see what the errors are in order to help. What may seem to be a memory error might not be. Please copy and paste some of the errors here.

Also: we need a lot more information about the problem. How many elements, how many variables, what shape functions, how many dofs? All of this information is at the top of your solve - feel free to copy/paste the system information here.

harterj commented 2 years ago

Working on getting you the information to show the errors -- just recompiled on Lemhi, and waiting on Sawtooth jobs to finish initiating, but I should have it this afternoon. Simulation stats, though, for 20 nodes with 6 procs per node on Sawtooth. :

Framework Information: MOOSE Version: git commit 6455f91545 on 2022-05-16 LibMesh Version: 5a7a9bd0e7295628f465c3bb4e42563b8b8c1a9c PETSc Version: 3.16.5 SLEPc Version: 3.16.2 Current Time: Thu May 19 17:37:21 2022 Executable Timestamp: Wed May 18 17:29:13 2022

Parallelism: Num Processors: 120 Num Threads: 1

Mesh: Parallel Type: replicated Mesh Dimension: 3 Spatial Dimension: 3 Nodes:
Total: 2963184 Local: 25480 Min/Max/Avg: 19328/26024/24693 Elems:
Total: 1911906 Local: 15584 Min/Max/Avg: 15218/18479/15932 Num Subdomains: 30 Num Partitions: 120 Partitioner: metis

Nonlinear System: Num DOFs: 2963184 Num Local DOFs: 25480 Variables: "temperature" Finite Element Types: "LAGRANGE" Approximation Orders: "FIRST"

Auxiliary System: Num DOFs: 9652434 Num Local DOFs: 86824 Variables: "powerDensity" "tfluid" "Hw_channel" Finite Element Types: "L2_LAGRANGE" "LAGRANGE" "MONOMIAL" Approximation Orders: "FIRST" "FIRST" "CONSTANT"

Execution Information: Executioner: Transient TimeStepper: IterationAdaptiveDT Solver Mode: Preconditioned JFNK MOOSE Preconditioner: SMP

This is with ~7000 relap channels, as well

harterj commented 2 years ago

This is with the averaged PD input I attached. The full 10 axial zone simulation has ~20m elements