prepare_for_use() taking too long

manavbhatia commented 4 years ago

Hi,

I have a 200x200x200 3D mesh made of Hex8 elements that I am creating using build_cube on a 64 rank communicator.

The prepare_for_use call is getting stuck somewhere and does not return even after an hour.

I have using libMesh@master from spack, which has the following configuration options. I am not sure where the bottleneck is inside prepare_for_use. Is this expected? If so, is there something that can be done to speed things up?

'--enable-glibcxx-debugging=no' '--disable-strict-lgpl' '--disable-hinnant-unique-ptr' '--enable-gzstreams=no' '--disable-bzip2' '--disable-xz' '--without-gdb-command' '--enable-tecio=no' '--enable-tecplot=no' '--enable-capnproto=no' '--enable-exodusii=yes' '--enable-fparser=no' '--enable-gmv=no' '--enable-laspack=no' '--enable-libHilbert=no' '--enable-metaphysicl=yes' '--enable-nanoflann=no' '--enable-nemesis=yes' '--enable-qhull=no' '--enable-sfc=no' '--enable-tetgen=no' '--enable-triangle=no' '--enable-netcdf=yes' '--disable-vtk' '--enable-metaphysicl' '--enable-perflog' '--disable-blocked-storage' '--enable-metis' '--enable-parmetis' '--enable-petsc=yes' '--enable-slepc=yes' '--with-methods=opt' '--enable-openmp=no' '--enable-pthreads=no' '--enable-tbb=no'

manavbhatia commented 4 years ago

Following is the configuration summary for reference:

----------------------------------- SUMMARY -----------------------------------

Package version............... : timpi-1.2

C++ compiler.................. : /apps/gcc-10.2.0/mpich-3.3.2/mpich-3.3.2/bin/mpic++
Build Methods...................... : opt
CPPFLAGS...(opt)................... : -DNDEBUG   
CXXFLAGS...(opt)................... :  -O2 -felide-constructors -funroll-loops -fstrict-aliasing -Wdisabled-optimization   

Build architecture............ : x86_64-pc-linux-gnu
Git revision number........... : 7fbe63d48e9d3ddee41b16935cdf172d3c604eec

-------------------------------------------------------------------------------
Optional Packages for Testing:
  MPI......................... : yes
  MPI_IMPL.................... : mpi
  timpi_optional_INCLUDES..... : -I/apps/gcc-10.2.0/mpich-3.3.2/mpich-3.3.2/include 
  timpi_optional_LIBS......... : -L/apps/gcc-10.2.0/mpich-3.3.2/mpich-3.3.2/lib -lmpi  

Configure complete, now type \'make\' and then \'make install\'.

----------------------------------- SUMMARY -----------------------------------

Package version.................... : libmesh-1.6.0-pre

C++ compiler type.................. : gcc-other
C++ compiler....................... : /apps/gcc-10.2.0/mpich-3.3.2/mpich-3.3.2/bin/mpic++
C compiler......................... : /apps/gcc-10.2.0/mpich-3.3.2/mpich-3.3.2/bin/mpicc
Fortran compiler................... : /apps/contrib/brg/codes/spack/lib/spack/env/gcc/gfortran
Build Methods...................... : opt

CPPFLAGS...(opt)................... : -DNDEBUG 
CXXFLAGS...(opt)................... :  -O2 -felide-constructors -funroll-loops -fstrict-aliasing -Wdisabled-optimization   
CFLAGS.....(opt)................... : -O2 -funroll-loops -fstrict-aliasing   

Any warnings-to-errors flags....... : 
Any extra paranoid warning flags... : 
Build architecture................. : x86_64-pc-linux-gnu
Git revision....................... : d05e18443ee51444d5878e911df01c52646d99b9

Library Features:
  library warnings................. : yes
  library deprecated code support.. : yes
  adaptive mesh refinement......... : yes
  blocked matrix/vector storage.... : no
  complex variables................ : no
  example suite.................... : yes
  ghosted vectors.................. : yes
  high-order shape functions....... : yes
  unique-id support................ : no
  id size (boundaries)............. : 2 bytes
  id size (dofs)................... : 4 bytes
  id size (processors)............. : 4 bytes
  id size (subdomains)............. : 2 bytes
  infinite elements................ : no
  Dirichlet constraints............ : yes
  node constraints................. : no
  parallel mesh.................... : no
  performance logging.............. : yes
  periodic boundary conditions..... : yes
  reference counting............... : yes
  shape function 2nd derivatives... : yes
  stack trace files................ : no
  track node valence............... : yes
  variational smoother............. : yes
  xdr binary I/O................... : yes

Optional Packages:
  boost............................ : yes
  capnproto........................ : no
  cppunit.......................... : no
  curl............................. : no
  eigen............................ : yes
  exodus........................... : yes
     version....................... : v5.22
  fparser.......................... : no
  glpk............................. : no
  gmv.............................. : no
  gzstream......................... : no
  hdf5............................. : no
  laspack.......................... : no
  libhilbert....................... : no
  metaphysicl...................... : yes
  metis............................ : yes
  mpi.............................. : yes
  nanoflann........................ : no
  nemesis.......................... : yes
     version....................... : v5.22
  netcdf........................... : yes
     version....................... : 4
  nlopt............................ : no
  parmetis......................... : yes
  petsc............................ : yes
     version....................... : 3.13.4
  qhull............................ : no
  sfcurves......................... : no
  slepc............................ : yes
     version....................... : 3.13.4
  thread model..................... : none
  c++ rtti ........................ : yes
  tecio............................ : no
  tecplot...(vendor binaries)...... : no
  tetgen........................... : no
  triangle......................... : no
  trilinos......................... : no
  vtk.............................. : no

jwpeterson commented 4 years ago

Over 1 hour to build_cube() on 8M elements definitely sounds like "deadlocked MPI" more than "really bad performance" to me. It's difficult to debug these types of issues at scale, so I'd recommend scaling back in mesh size and/or processor count until you find a case or two that is working. That would at least allow you to extrapolate whether > 1 hr would be a reasonable time for the desired 200^3 case.

As far as diagnosing the hang, if you re-compile in oprof mode, you may be able to attach to one or more of the running processes using GDB and try to figure out where the hang is. If almost all of them appear to be in some MPI collective, then it probably means there's one process which isn't, indicating a deadlock.

manavbhatia commented 4 years ago

I experimented with the problem size a bit and was able to get the 180x180x180 mesh to go through. I am suspecting that the 200x200x200 problem may be utilizing the swap space due to large memory footprint, resulting in much slower progress. If so, this would not be a MPI issue, but a memory bottleneck. Thoughts?

Isn't there a way to partition a mesh a-priori and then read it in for the production run? If so, would that be expected to alleviate the memory issue (if that is indeed the case).

roystgnr commented 4 years ago

Thoughts?

It's at least a memory bottleneck. The way libMesh distributed mesh generation currently works is that we generate a DistributedMesh in serial and then we partition it and remove the remote elements on each processor. This is simply because our generators predate DistributedMesh, and obviously this scales painfully poorly. Except for Nemesis and checkpoint IO, our distributed-mesh generators are not distributed mesh-generators.

In the long term, we need to beg @fdkong to move his new distributed mesh-generator from MOOSE upstream into libMesh.

In the short term, you'd be best off generating a 25x25x25 mesh and then doing 3 uniform refinements, flattening afterward if you need/want to get rid of the coarse elements. The refinement gets done after mesh distribution and should scale much better. If the coarse load balancing is okay then you could also turn off repartitioning during the refinement to get to your fine mesh even faster.

jwpeterson commented 4 years ago

a 64 rank communicator.

I should have asked, are these 64 procs all on the same machine, i.e. using the same pool of memory? Another way to find out if it's indeed a memory issue is to use fewer procs... with ReplicatedMesh that will give you that many fewer copies of the Mesh to allocate...

jwpeterson commented 4 years ago

In the short term, you'd be best off generating a 25x25x25 mesh and then doing 3 uniform refinements,

This, combined with using DistributedMesh should actually work pretty well? That way the refined elements are only created on their parents' procs right?

roystgnr commented 4 years ago

Almost - at the borders between procs, you'll still have ghost elements, and you'll still have the ghost elements ancestors.

So in the final 200x200x200 active-element mesh, if each of 64 processors owns around 125k of them, they'll also have copies of nearly 20k can't-be-deleted ghosts and ghosts' ancestors. That's still better than having copies of 7875k havent-yet-been-deleted ghosts, though.

manavbhatia commented 4 years ago

Thanks @roystgnr for your recommendation about uniform refinement and flattening. I was able to get it to work.

fdkong commented 4 years ago

In the long term, we need to beg @fdkong to move his new distributed mesh-generator from MOOSE upstream into libMesh.

Yes, it is fine for me. If we are going to do this, we need to move partitioner as well.

libMesh / libmesh

prepare_for_use() taking too long #2706