Open friedmud opened 6 years ago
@friedmud this great info, thanks. We are also noticing slow performance with refinement (which we use to build up mesh hierarchy for geometric multigrid) that could benefit from optimization considerations like the above.
Could you maybe provide some quick pointers to the profiling tools you used? We've struggled finding ones that we can get to work reliably (on our ancient-OS workstations...) and give us understandable output, e.g. like the beautiful picture in your post.
I hope this isn't overly negative...
In the context of profiling for optimization purposes, I think a "negative result" is "couldn't find anything that can be sped up"; "look at all this slow shit we could make faster" is a positive result!
I'm probably not going to have time to help much until I'm back from vacation; just want to make sure that my silence until then is not interpreted as "Roy thinks this information is useless" or "Roy has been emotionally harmed by this information". ;-)
@roystgnr thanks for chiming in. I'm doing some more profiling today too... so we will have more info.
@pbauman @rwcarlsen pointed us to Google Perf Tools ( https://github.com/gperftools/gperftools ) and helped us get up and running with it. It wasn't too bad... you might need to hand build libunwind
... but that's about it.
Then you add a library to the link line and call a function to start profiling and stop profiling. Pretty simple.
@rwcarlsen Can you please do a quick writeup of this in MOOSE docs so we can share this info?
@friedmud Thanks for the info! @bboutkov, we should have a look at this (after first draft of paper is done!).
The "iterating over filtered sets actually expensively iterates over the whole mesh and adds another expensive filter test on top of that" problem is one I've always looked out for but never seen; I guess looping over 20K pids one at a time is pretty much the worst case scenario for it, though.
MeshBase::element_iterator is actually a stupidly flexible (in both the positive and negative senses) class. Check out the fake_elem_it/fake_elem_end "range" in sparsity_pattern.C - that's only over one element, but you ought to be able to create a manual ConstElemRange over an arbitrary vector<Elem*>
with as many elements as you want!
Is now the time to try switching from std::multimap to std::unordered_multimap for boundary ids? We really don't have any reason except easier-with-C++98 history to use the former.
@roystgnr Do you have a chance to look at this issue recently? It is a somewhat bottleneck for scaling study now.
You guys are using the src/apps/splitter.C in master? Can you send me a mesh that splits for you in an annoying-but-not-too-annoying amount of time on 128 procs (as well as the --n-procs --num-ghost-layer --ascii options
you use)? A smaller 24-proc friendly mesh too would be helpful if you have something handy; I usually can grab 128 rapid-turnaround-time cores for benchmarking purposes but not always.
I'll start with the above two discussed optimizations ASAP but I'd feel more confident if I could measure speedups myself before getting a branch cleaned up to push to you guys.
We aren't using the libMesh splitter, but we are likely calling the same underlying methods. Fande, you'll need to supply a test case. I was under the impression that we were doing a lot better with the initialization stage.
On Sun, Sep 9, 2018 at 10:09 AM roystgnr notifications@github.com wrote:
You guys are using the src/apps/splitter.C in master? Can you send me a mesh that splits for you in an annoying-but-not-too-annoying amount of time on 128 procs (as well as the --n-procs --num-ghost-layer --ascii options you use)? A smaller 24-proc friendly mesh too would be helpful if you have something handy; I usually can grab 128 rapid-turnaround-time cores for benchmarking purposes but not always.
I'll start with the above two discussed optimizations ASAP but I'd feel more confident if I could measure speedups myself before getting a branch cleaned up to push to you guys.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/libMesh/libmesh/issues/1821#issuecomment-419726289, or mute the thread https://github.com/notifications/unsubscribe-auth/AC5XIJD01hvvuYWF_JVOccJaTEAcvDUfks5uZT0sgaJpZM4VzRZC .
If it's the same split_mesh() method then I ought to be able to reproduce the problem with the basic splitter.
https://github.com/idaholab/moose/blob/devel/framework/src/actions/SplitMeshAction.C#L79
On Mon, Sep 10, 2018 at 9:55 AM roystgnr notifications@github.com wrote:
If it's the same split_mesh() method then I ought to be able to reproduce the problem with the basic splitter.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/libMesh/libmesh/issues/1821#issuecomment-419963854, or mute the thread https://github.com/notifications/unsubscribe-auth/AC5XIJ0Tq1-NeA66d9D-TN5yOZp-j95wks5uZot9gaJpZM4VzRZC .
@roystgnr There are a lot examples that can demonstrate this issue in MOOSE. For example
in moose/modules/phase_field/examples/grain_growth
../../phase_field-opt -i 3D_6000_gr.i Mesh/nx=180 Mesh/ny=180 Mesh/nz=180 --split-file 3D_6000_gr --split-mesh 8192
You could start with a coarse mesh such as Mesh/nx=20 Mesh/ny=20 Mesh/nz=20 --split-mesh 10
, and then you increase the mesh density and the processor count gradually.
If you have hard time to run the example, I could sent you the meshes generated from the example.
The pre-splitting is deficient because we simply have an O(n^2) algorithm. For each partition, the whole mesh has to be revisited. We have 10K partitions, then the whole mesh will be revisited 10K times.
Thanks for all of the optimization - let me run some new numbers
If #1940 works out well, I'm tempted to close this and #1787 afterwards. The only remaining inefficiency in the list is the boundary_ids() issue, and everything I've tried there (#1875, #1787) has either turned out to break or to be even slower.
I'm currently splitting a 2.2GB Exodus mesh with about 55 Million elements using 48 MPI procs. I'm splitting it a few ways: ~9000 procs, ~500 procs, ~100 procs, etc.
Doing so has uncovered some pretty large inefficiencies that I'd appreciate some discussion on.
Here is a profile run (using Google Perf Tools - showing just processor 0) for splitting it for 144 procs:
And here are the "top" functions from that profiling
GhostingFunctor
The first major one is that
libMesh::GhostPointNeighbors::operator()
is a full loop over the full mesh on every processor for every split. So it'sn_procs * n_elements
kind of operation. So when you're talking about ~20k procs and 1 billion elements... just this piece gets prohibitive. In the above timing you can see that this ends up taking up 14% of the runtime - but if you are splitting for many more procs this gets prohibitve fast.This is really a major problem... it is taking many hours using many nodes of a cluster just to do some of these larger splits...
This is happening because
query_ghosting_functors()
is used inCheckpointIO::split_mesh()
... and it ends up iterating overactive_pid_elements()
for each split.One way to speed this up would be to iterate over the mesh once and build a cache/map of pid -> elements... then feed those lists to the ghosting functors as the ranges. Unfortunately, the interface for
GhostingFunctior::operator()
currently takes aConstElemRange
iterator... not astd::vector::iterator
(or whatever datastructure we're going to store these things in). Any ideas here on what to do about that? Or is there maybe a better way to speed this up?BoundaryInfo::boundary_ids()
The next one is
BoundaryInfo::boundary_ids()
. As you can see in the above timing it's taking up 13% of the total runtime... and is mainly called during the "packing" phase before processor 0 ships off the mesh to all the other processors.Most of the time is in
std::multimap::equal_range()
. I am actually struggling with the same problem in my ray-tracing code right now too... where looking up boundary IDs using this function is taking an appreciable amount of the overall runtime. I'm open to ideas on how to mitigate this.Elem::operator==()
One small one that is just a personal pet peeve is that
Elem::operator==
does so much memory allocation and deallocation. This is a huge killer for threading... and causes unnecessary slowdown. We should really work to eradicate this kind of thing where temporary vectors are created and destroyed so frequently.Quad::side_ptr()
A similar one that sucks (but there might be no way around it) is that
UnstructuredMesh::find_neighbors()
callsElem::side_ptr()
a whole crap ton... which ends up causing tons of memory churn again as the SideElems get created and destroyed. Also: it causes apthread_spin_lock
to show up... making it even worse for threadingMemory
Memory usage is also out of control. It's taking upwards of 40GB of RAM to read this mesh in (when it's a 2.2GB Exodus file). I know that some of this is MooseMesh caching... but I seriously don't understand where the memory is going. I'm going to do some heap profiling with Google Perf Tools and get back to you with some profiling info on what's going on here. This has become a pretty major pain point for us.
Anyway - I hope this isn't overly negative... I'm just learning some new info about all of this and I'd love if we could brainstorm some ideas for mitigating some of it.