Closed manavbhatia closed 4 years ago
Following is the configuration summary for reference:
----------------------------------- SUMMARY -----------------------------------
Package version............... : timpi-1.2
C++ compiler.................. : /apps/gcc-10.2.0/mpich-3.3.2/mpich-3.3.2/bin/mpic++
Build Methods...................... : opt
CPPFLAGS...(opt)................... : -DNDEBUG
CXXFLAGS...(opt)................... : -O2 -felide-constructors -funroll-loops -fstrict-aliasing -Wdisabled-optimization
Build architecture............ : x86_64-pc-linux-gnu
Git revision number........... : 7fbe63d48e9d3ddee41b16935cdf172d3c604eec
-------------------------------------------------------------------------------
Optional Packages for Testing:
MPI......................... : yes
MPI_IMPL.................... : mpi
timpi_optional_INCLUDES..... : -I/apps/gcc-10.2.0/mpich-3.3.2/mpich-3.3.2/include
timpi_optional_LIBS......... : -L/apps/gcc-10.2.0/mpich-3.3.2/mpich-3.3.2/lib -lmpi
Configure complete, now type \'make\' and then \'make install\'.
----------------------------------- SUMMARY -----------------------------------
Package version.................... : libmesh-1.6.0-pre
C++ compiler type.................. : gcc-other
C++ compiler....................... : /apps/gcc-10.2.0/mpich-3.3.2/mpich-3.3.2/bin/mpic++
C compiler......................... : /apps/gcc-10.2.0/mpich-3.3.2/mpich-3.3.2/bin/mpicc
Fortran compiler................... : /apps/contrib/brg/codes/spack/lib/spack/env/gcc/gfortran
Build Methods...................... : opt
CPPFLAGS...(opt)................... : -DNDEBUG
CXXFLAGS...(opt)................... : -O2 -felide-constructors -funroll-loops -fstrict-aliasing -Wdisabled-optimization
CFLAGS.....(opt)................... : -O2 -funroll-loops -fstrict-aliasing
Any warnings-to-errors flags....... :
Any extra paranoid warning flags... :
Build architecture................. : x86_64-pc-linux-gnu
Git revision....................... : d05e18443ee51444d5878e911df01c52646d99b9
Library Features:
library warnings................. : yes
library deprecated code support.. : yes
adaptive mesh refinement......... : yes
blocked matrix/vector storage.... : no
complex variables................ : no
example suite.................... : yes
ghosted vectors.................. : yes
high-order shape functions....... : yes
unique-id support................ : no
id size (boundaries)............. : 2 bytes
id size (dofs)................... : 4 bytes
id size (processors)............. : 4 bytes
id size (subdomains)............. : 2 bytes
infinite elements................ : no
Dirichlet constraints............ : yes
node constraints................. : no
parallel mesh.................... : no
performance logging.............. : yes
periodic boundary conditions..... : yes
reference counting............... : yes
shape function 2nd derivatives... : yes
stack trace files................ : no
track node valence............... : yes
variational smoother............. : yes
xdr binary I/O................... : yes
Optional Packages:
boost............................ : yes
capnproto........................ : no
cppunit.......................... : no
curl............................. : no
eigen............................ : yes
exodus........................... : yes
version....................... : v5.22
fparser.......................... : no
glpk............................. : no
gmv.............................. : no
gzstream......................... : no
hdf5............................. : no
laspack.......................... : no
libhilbert....................... : no
metaphysicl...................... : yes
metis............................ : yes
mpi.............................. : yes
nanoflann........................ : no
nemesis.......................... : yes
version....................... : v5.22
netcdf........................... : yes
version....................... : 4
nlopt............................ : no
parmetis......................... : yes
petsc............................ : yes
version....................... : 3.13.4
qhull............................ : no
sfcurves......................... : no
slepc............................ : yes
version....................... : 3.13.4
thread model..................... : none
c++ rtti ........................ : yes
tecio............................ : no
tecplot...(vendor binaries)...... : no
tetgen........................... : no
triangle......................... : no
trilinos......................... : no
vtk.............................. : no
Over 1 hour to build_cube()
on 8M elements definitely sounds like "deadlocked MPI" more than "really bad performance" to me. It's difficult to debug these types of issues at scale, so I'd recommend scaling back in mesh size and/or processor count until you find a case or two that is working. That would at least allow you to extrapolate whether > 1 hr would be a reasonable time for the desired 200^3 case.
As far as diagnosing the hang, if you re-compile in oprof mode, you may be able to attach to one or more of the running processes using GDB and try to figure out where the hang is. If almost all of them appear to be in some MPI collective, then it probably means there's one process which isn't, indicating a deadlock.
I experimented with the problem size a bit and was able to get the 180x180x180 mesh to go through. I am suspecting that the 200x200x200 problem may be utilizing the swap space due to large memory footprint, resulting in much slower progress. If so, this would not be a MPI issue, but a memory bottleneck. Thoughts?
Isn't there a way to partition a mesh a-priori and then read it in for the production run? If so, would that be expected to alleviate the memory issue (if that is indeed the case).
Thoughts?
It's at least a memory bottleneck. The way libMesh distributed mesh generation currently works is that we generate a DistributedMesh in serial and then we partition it and remove the remote elements on each processor. This is simply because our generators predate DistributedMesh, and obviously this scales painfully poorly. Except for Nemesis and checkpoint IO, our distributed-mesh generators are not distributed mesh-generators.
In the long term, we need to beg @fdkong to move his new distributed mesh-generator from MOOSE upstream into libMesh.
In the short term, you'd be best off generating a 25x25x25 mesh and then doing 3 uniform refinements, flattening afterward if you need/want to get rid of the coarse elements. The refinement gets done after mesh distribution and should scale much better. If the coarse load balancing is okay then you could also turn off repartitioning during the refinement to get to your fine mesh even faster.
a 64 rank communicator.
I should have asked, are these 64 procs all on the same machine, i.e. using the same pool of memory? Another way to find out if it's indeed a memory issue is to use fewer procs... with ReplicatedMesh that will give you that many fewer copies of the Mesh to allocate...
In the short term, you'd be best off generating a 25x25x25 mesh and then doing 3 uniform refinements,
This, combined with using DistributedMesh
should actually work pretty well? That way the refined elements are only created on their parents' procs right?
Almost - at the borders between procs, you'll still have ghost elements, and you'll still have the ghost elements ancestors.
So in the final 200x200x200 active-element mesh, if each of 64 processors owns around 125k of them, they'll also have copies of nearly 20k can't-be-deleted ghosts and ghosts' ancestors. That's still better than having copies of 7875k havent-yet-been-deleted ghosts, though.
Thanks @roystgnr for your recommendation about uniform refinement and flattening. I was able to get it to work.
In the long term, we need to beg @fdkong to move his new distributed mesh-generator from MOOSE upstream into libMesh.
Yes, it is fine for me. If we are going to do this, we need to move partitioner as well.
Hi,
I have a
200x200x200
3D mesh made ofHex8
elements that I am creating usingbuild_cube
on a 64 rank communicator.The
prepare_for_use
call is getting stuck somewhere and does not return even after an hour.I have using
libMesh@master
from spack, which has the following configuration options. I am not sure where the bottleneck is insideprepare_for_use
. Is this expected? If so, is there something that can be done to speed things up?'--enable-glibcxx-debugging=no' '--disable-strict-lgpl' '--disable-hinnant-unique-ptr' '--enable-gzstreams=no' '--disable-bzip2' '--disable-xz' '--without-gdb-command' '--enable-tecio=no' '--enable-tecplot=no' '--enable-capnproto=no' '--enable-exodusii=yes' '--enable-fparser=no' '--enable-gmv=no' '--enable-laspack=no' '--enable-libHilbert=no' '--enable-metaphysicl=yes' '--enable-nanoflann=no' '--enable-nemesis=yes' '--enable-qhull=no' '--enable-sfc=no' '--enable-tetgen=no' '--enable-triangle=no' '--enable-netcdf=yes' '--disable-vtk' '--enable-metaphysicl' '--enable-perflog' '--disable-blocked-storage' '--enable-metis' '--enable-parmetis' '--enable-petsc=yes' '--enable-slepc=yes' '--with-methods=opt' '--enable-openmp=no' '--enable-pthreads=no' '--enable-tbb=no'