Optimization of Mesh Reading / Initialization / Splitting

friedmud commented 6 years ago

I'm currently splitting a 2.2GB Exodus mesh with about 55 Million elements using 48 MPI procs. I'm splitting it a few ways: ~9000 procs, ~500 procs, ~100 procs, etc.

Doing so has uncovered some pretty large inefficiencies that I'd appreciate some discussion on.

Here is a profile run (using Google Perf Tools - showing just processor 0) for splitting it for 144 procs:

profile001

And here are the "top" functions from that profiling

(pprof) top
Showing nodes accounting for 462.35s, 53.41% of 865.65s total
Dropped 282 nodes (cum <= 4.33s)
Showing top 10 nodes out of 118
      flat  flat%   sum%        cum   cum%
   122.81s 14.19% 14.19%    122.81s 14.19%  std::_Rb_tree::equal_range
    98.43s 11.37% 25.56%    122.04s 14.10%  libMesh::GhostPointNeighbors::operator()
    52.40s  6.05% 31.61%    113.26s 13.08%  MooseMesh::cacheInfo
    47.91s  5.53% 37.15%     47.91s  5.53%  __memset_sse2
    28.58s  3.30% 40.45%     28.80s  3.33%  _int_malloc
    27.16s  3.14% 43.58%     27.16s  3.14%  std::_Rb_tree::find
    23.06s  2.66% 46.25%     23.06s  2.66%  libMesh::Predicates::pid::operator()
    22.24s  2.57% 48.82%     22.41s  2.59%  std::_Rb_tree::_M_get_insert_unique_pos
    20.29s  2.34% 51.16%     58.02s  6.70%  variant_filter_iterator::Pred::operator()
    19.47s  2.25% 53.41%     19.87s  2.30%  MPIDI_CH3I_MRAILI_Get_next_vbuf

GhostingFunctor

The first major one is that libMesh::GhostPointNeighbors::operator() is a full loop over the full mesh on every processor for every split. So it's n_procs * n_elements kind of operation. So when you're talking about ~20k procs and 1 billion elements... just this piece gets prohibitive. In the above timing you can see that this ends up taking up 14% of the runtime - but if you are splitting for many more procs this gets prohibitve fast.

This is really a major problem... it is taking many hours using many nodes of a cluster just to do some of these larger splits...

This is happening because query_ghosting_functors() is used in CheckpointIO::split_mesh()... and it ends up iterating over active_pid_elements() for each split.

One way to speed this up would be to iterate over the mesh once and build a cache/map of pid -> elements... then feed those lists to the ghosting functors as the ranges. Unfortunately, the interface for GhostingFunctior::operator() currently takes a ConstElemRange iterator... not a std::vector::iterator (or whatever datastructure we're going to store these things in). Any ideas here on what to do about that? Or is there maybe a better way to speed this up?

BoundaryInfo::boundary_ids()

The next one is BoundaryInfo::boundary_ids(). As you can see in the above timing it's taking up 13% of the total runtime... and is mainly called during the "packing" phase before processor 0 ships off the mesh to all the other processors.

Most of the time is in std::multimap::equal_range(). I am actually struggling with the same problem in my ray-tracing code right now too... where looking up boundary IDs using this function is taking an appreciable amount of the overall runtime. I'm open to ideas on how to mitigate this.

Elem::operator==()

One small one that is just a personal pet peeve is that Elem::operator== does so much memory allocation and deallocation. This is a huge killer for threading... and causes unnecessary slowdown. We should really work to eradicate this kind of thing where temporary vectors are created and destroyed so frequently.

Quad::side_ptr()

A similar one that sucks (but there might be no way around it) is that UnstructuredMesh::find_neighbors() calls Elem::side_ptr() a whole crap ton... which ends up causing tons of memory churn again as the SideElems get created and destroyed. Also: it causes a pthread_spin_lock to show up... making it even worse for threading

Memory

Memory usage is also out of control. It's taking upwards of 40GB of RAM to read this mesh in (when it's a 2.2GB Exodus file). I know that some of this is MooseMesh caching... but I seriously don't understand where the memory is going. I'm going to do some heap profiling with Google Perf Tools and get back to you with some profiling info on what's going on here. This has become a pretty major pain point for us.

Anyway - I hope this isn't overly negative... I'm just learning some new info about all of this and I'd love if we could brainstorm some ideas for mitigating some of it.

pbauman commented 6 years ago

@friedmud this great info, thanks. We are also noticing slow performance with refinement (which we use to build up mesh hierarchy for geometric multigrid) that could benefit from optimization considerations like the above.

Could you maybe provide some quick pointers to the profiling tools you used? We've struggled finding ones that we can get to work reliably (on our ancient-OS workstations...) and give us understandable output, e.g. like the beautiful picture in your post.

roystgnr commented 6 years ago

I hope this isn't overly negative...

In the context of profiling for optimization purposes, I think a "negative result" is "couldn't find anything that can be sped up"; "look at all this slow shit we could make faster" is a positive result!

I'm probably not going to have time to help much until I'm back from vacation; just want to make sure that my silence until then is not interpreted as "Roy thinks this information is useless" or "Roy has been emotionally harmed by this information". ;-)

friedmud commented 6 years ago

@roystgnr thanks for chiming in. I'm doing some more profiling today too... so we will have more info.

@pbauman @rwcarlsen pointed us to Google Perf Tools ( https://github.com/gperftools/gperftools ) and helped us get up and running with it. It wasn't too bad... you might need to hand build libunwind... but that's about it.

Then you add a library to the link line and call a function to start profiling and stop profiling. Pretty simple.

@rwcarlsen Can you please do a quick writeup of this in MOOSE docs so we can share this info?

pbauman commented 6 years ago

@friedmud Thanks for the info! @bboutkov, we should have a look at this (after first draft of paper is done!).

roystgnr commented 6 years ago

The "iterating over filtered sets actually expensively iterates over the whole mesh and adds another expensive filter test on top of that" problem is one I've always looked out for but never seen; I guess looping over 20K pids one at a time is pretty much the worst case scenario for it, though.

MeshBase::element_iterator is actually a stupidly flexible (in both the positive and negative senses) class. Check out the fake_elem_it/fake_elem_end "range" in sparsity_pattern.C - that's only over one element, but you ought to be able to create a manual ConstElemRange over an arbitrary vector<Elem*> with as many elements as you want!

roystgnr commented 6 years ago

Is now the time to try switching from std::multimap to std::unordered_multimap for boundary ids? We really don't have any reason except easier-with-C++98 history to use the former.

fdkong commented 6 years ago

@roystgnr Do you have a chance to look at this issue recently? It is a somewhat bottleneck for scaling study now.

roystgnr commented 6 years ago

You guys are using the src/apps/splitter.C in master? Can you send me a mesh that splits for you in an annoying-but-not-too-annoying amount of time on 128 procs (as well as the --n-procs --num-ghost-layer --ascii options you use)? A smaller 24-proc friendly mesh too would be helpful if you have something handy; I usually can grab 128 rapid-turnaround-time cores for benchmarking purposes but not always.

I'll start with the above two discussed optimizations ASAP but I'd feel more confident if I could measure speedups myself before getting a branch cleaned up to push to you guys.

permcody commented 6 years ago

We aren't using the libMesh splitter, but we are likely calling the same underlying methods. Fande, you'll need to supply a test case. I was under the impression that we were doing a lot better with the initialization stage.

On Sun, Sep 9, 2018 at 10:09 AM roystgnr notifications@github.com wrote:

You guys are using the src/apps/splitter.C in master? Can you send me a mesh that splits for you in an annoying-but-not-too-annoying amount of time on 128 procs (as well as the --n-procs --num-ghost-layer --ascii options you use)? A smaller 24-proc friendly mesh too would be helpful if you have something handy; I usually can grab 128 rapid-turnaround-time cores for benchmarking purposes but not always.

I'll start with the above two discussed optimizations ASAP but I'd feel more confident if I could measure speedups myself before getting a branch cleaned up to push to you guys.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/libMesh/libmesh/issues/1821#issuecomment-419726289, or mute the thread https://github.com/notifications/unsubscribe-auth/AC5XIJD01hvvuYWF_JVOccJaTEAcvDUfks5uZT0sgaJpZM4VzRZC .

roystgnr commented 6 years ago

If it's the same split_mesh() method then I ought to be able to reproduce the problem with the basic splitter.

permcody commented 6 years ago

https://github.com/idaholab/moose/blob/devel/framework/src/actions/SplitMeshAction.C#L79

On Mon, Sep 10, 2018 at 9:55 AM roystgnr notifications@github.com wrote:

If it's the same split_mesh() method then I ought to be able to reproduce the problem with the basic splitter.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/libMesh/libmesh/issues/1821#issuecomment-419963854, or mute the thread https://github.com/notifications/unsubscribe-auth/AC5XIJ0Tq1-NeA66d9D-TN5yOZp-j95wks5uZot9gaJpZM4VzRZC .

fdkong commented 6 years ago

@roystgnr There are a lot examples that can demonstrate this issue in MOOSE. For example

in moose/modules/phase_field/examples/grain_growth

../../phase_field-opt -i 3D_6000_gr.i Mesh/nx=180 Mesh/ny=180 Mesh/nz=180 --split-file 3D_6000_gr --split-mesh 8192

You could start with a coarse mesh such as Mesh/nx=20 Mesh/ny=20 Mesh/nz=20 --split-mesh 10, and then you increase the mesh density and the processor count gradually.

If you have hard time to run the example, I could sent you the meshes generated from the example.

The pre-splitting is deficient because we simply have an O(n^2) algorithm. For each partition, the whole mesh has to be revisited. We have 10K partitions, then the whole mesh will be revisited 10K times.

friedmud commented 6 years ago

Thanks for all of the optimization - let me run some new numbers

roystgnr commented 5 years ago

If #1940 works out well, I'm tempted to close this and #1787 afterwards. The only remaining inefficiency in the list is the boundary_ids() issue, and everything I've tried there (#1875, #1787) has either turned out to break or to be even slower.

libMesh / libmesh