A lot of time spent in Build::handle_vi_vj

lindsayad commented 2 years ago

This problem has ~29k local dofs (352,080 global dofs on 16 processes). There are 7 constant monomial variables (I believe actually a maximum of 6 on a given subdomain). I am seeing more time spent in Build::handle_vi_vj than I've ever seen before. Attached is the call-graph. Of the 177 seconds spent in the method, 155.78 seconds of it are spent on the call to SparsityPattern::sort_row at the end of the method. I don't think that method shows up in the call-graph explicitly due to inlining. Are there potentials for algorithmic improvement here? It looks like @roystgnr is the primary author of handle_vi_vj while sort_row may have been initially created by @dknez with later mods by @roystgnr and @jwpeterson

slow-initial-setup

lindsayad commented 2 years ago

@GiudGiud @snschune I generated this call-graph looking at @snschune's SFR inputs

roystgnr commented 2 years ago

What kind of stencil size do you have here?

It looks like sort_row is trying to be smarter than std::inplace_merge and failing. Maybe for tiny rows it really ran better by avoiding heap allocation, but it's O(N^2) rather than O(N) and I'd expect that to be a disaster for large rows. I'm going to go dig through git history now to try to figure out how I worked on a bubble sort here and didn't scream about it.

You might try replacing the implementation with inplace_merge and see what happens.

lindsayad commented 2 years ago

What kind of stencil size do you have here?

It's pretty large ... the max columns per row is > 53 (I know that because I had to reconfigure the MOOSE AD derivative container size)

You might try replacing the implementation with inplace_merge and see what happens.

I'll try this soon and report back

lindsayad commented 2 years ago

And I don't think is relevant, but each element depends on at least two neighbor layers of elements (maybe even 3), so the element stencil is definitely larger than most libMesh-based simulations

roystgnr commented 2 years ago

I'm seeing an initial sort_row implementation from Ben Kirk (as sort_sparsity_row, in dof_map.h) in May 2005. I bet the row sizes then really were so small that asymptotic problems didn't get triggered.

My own modification was just to clean up some assertions, so at least it might have been laziness rather than stupidity that led me to miss the performance issues right above. But I was specifically working on handle_vi_vj for performance reasons, so I don't know how I missed the problem in its subroutines then. IIRC the ill-performing application was using a ton of variables with a very sparse coupling between them, so perhaps the individual row sizes in my case just weren't large enough to show up on profiling either.

I'll try this soon and report back

Thanks. If it works for your case I'm going to throw it at LIBMESH_BENCHMARK too; I have a slight fear that switching to the std:: algorithm with heap allocation might actually make common cases slower, and I have a great fear that the exact opposite is true and all this time we've been noticeably pessimizing the common cases too...

lindsayad commented 2 years ago

Going to inplace_merge effectively didn't change anything (158 seconds spent in the call to sort_row vs 156 seconds before). Are you sure the algorithms are different? The doxygen that @jwpeterson wrote sure makes it seem like they should be algorithmically similar. Anyway it seems like maybe we can't do anything in sort_row... which maybe you concluded in the past

inplace_merge

roystgnr commented 2 years ago

Algorithms sure ought to be different. inplace_merge is guaranteed O(N); our algorithm is doing a bubble sort, even taking advantage of the subsequences being sorted that's still O(N^2) in the worst case.

I guess we might not be in the worst case? Our sort_row is O(1) in the pre-sorted case, which might actually be pretty common in our usage pattern. But it's really weird to see nearly the exact same results from inplace_merge - I can see how we might easily be much faster or slower but I don't see how we'd end up being the same. And in the case where we'd be faster, I'd expect us to be so much faster that something else would be where the most expense is. Maybe the vector::insert right before the sort_row call, maybe the equal_range searches, but something.

lindsayad commented 2 years ago

If I didn't see the 158 seconds in std::__merge_adaptive I would think that my binary wasn't using the libMesh changes ... but I do see it

lindsayad commented 2 years ago

our algorithm is doing a bubble sort

Hmm yea it clearly says bubble sort even in the comment doesn't it 😆

lindsayad commented 2 years ago

So I'm using all CONSTANT MONOMIAL variables ... in that case I'm only ever coming in with 1 dof to add ... so the bubble sort algorithm in that case will be O(N*1), right? That's my hypothesis for the comparable execution times.

So if my reasoning is accurate, and we've got O(N) for both the bubble sort and the inplace-merge in this special situation, then that really suggests to me that the algorithmic improvement would have to be somewhere else if it's even possible. Maybe it's not. This could just be what life is like for these large element stencil finite volume methods ... we have way more coupled elements than compact support finite element methods, and we've got to loop over all of them when building the sparsity pattern

lindsayad commented 2 years ago

Would there be any merit in storing the coupled elements as a map sorted by element ID? Then maybe you get away with O(1) sorting most of the time (e.g. potentially don't have to do any swaps)?

lindsayad commented 2 years ago

In this finite volume case where everything is constant monomial, if the coupled elements were sorted by ID, then:

depending on whether you are doing variable grouping, you could either do outer coupled element loop-inner variable loop or visa versa and then the addition to the sparsity pattern should be ordered

lindsayad commented 2 years ago

Going to the sorted map type would probably only make sense for coupling functors 😦 At least in those cases the number of coupled elements is small and insertion shouldn't be all that impactful

roystgnr commented 2 years ago

I'm only ever coming in with 1 dof to add ... so the bubble sort algorithm in that case will be O(N*1), right?

Hmm... that's right. So either way, with N dofs and M randomly-ordered dofs per row, we ought to be seeing O(N*M^2) sparsity pattern build times?

the algorithmic improvement would have to be somewhere else if it's even possible.

And that seems to be right.

Maybe it's not.

But that's quitter talk!

Would there be any merit in storing the coupled elements as a map sorted by element ID?

So then the bubble sort in your case is O(1), and we should see O(N*M) for the row sorting? Storing the map is more expensive than the current unordered map, but all the map storage works out to O(N*log(M)), so we still beat O(N*M^2).

It'll be API-breaking, but it's a really easy API change to make so it's worth trying first, to see what your case gets out of it.

Less intrusively, we might keep the dof rows in maps, and only sort them into their final vectors at the end; I think that also gives us O(N*log(M))?

std::map has a bunch of constant-factor overhead due to the memory access patterns ... we could try using push_heap etc. for either case. If we used a comparator to make a min-heap instead of a max-heap then we could even take advantage of the "inserts are mostly in increasing order" property... I don't think that gets us any asymptotic improvement, though; even if the actual heap insert is usually O(1), heap reallocation is O(log(N)) amortized and we wouldn't know in advance what to reserve() to get around that (though we could probably come up with some pretty good heuristics...)

lindsayad commented 2 years ago

Ok with inplace_merge and using a std::map sorted by dof object ID, we went down to 127 seconds being spent on the call to sort_row from handle_vi_vj, which is about a 30 second improvement. Overall time (essentially simulation setup time), went from 281 seconds to 245 seconds

lindsayad commented 2 years ago

141 seconds in handle_vi_vj -> sort_row with the bubble sort ... going to check and see whether I'm actually getting the "already sorted" behavior I expected

lindsayad commented 2 years ago

Oops yea I had identify_variable_groups = false. Moving to true, I now only have 56 seconds in handle_vi_vj -> sort_row with the ordered map for coupled_elements and the bubble sort... which still seems like an egregiously long time doesn't it? I believe we should just be incrementing the middle iterator once in sort_row now and that should be the only non-comparison operation done in there

lindsayad commented 2 years ago

It's disappointing that there isn't really a great option for profiling inlined functions ... you can tell the compiler not to inline the function but then you've fundamentally altered the performance of the program you're trying to profile. Anyway, call graph for the last setup

do-ivg

roystgnr commented 2 years ago

It's disappointing that there isn't really a great option for profiling inlined functions

Have you tried perf annotate? IIRC (although I probably do not RC; I'm thinking back to when "oprof" stood for "oprofile"...) there can still be some "leakage" of sample attributions from one line of code to a neighboring line, but you should still get much finer details than you can get from just the call graph.

roystgnr commented 2 years ago

I believe we should just be incrementing the middle iterator once in sort_row now and that should be the only non-comparison operation done in there

A couple logical operations, a couple dereferences ... and comparisons can be much more expensive than non-comparisons, screwing up CPU pipelining ... but yeah, 56 seconds seems way too long here. I'd be tempted to throw an abort into that inner loop and make sure you're never actually entering it in your use case.

lindsayad commented 2 years ago

Damn yea still getting in there 😦

         .          .    109:void sort_row (const SparsityPattern::Row::iterator begin,
         .          .    110:               SparsityPattern::Row::iterator       middle,
         .          .    111:               const SparsityPattern::Row::iterator end)
         .          .    112:{
     140ms      140ms    113:  if ((begin == middle) || (middle == end)) return;
         .          .    114:
         .          .    115:  libmesh_assert_greater (std::distance (begin,  middle), 0);
         .          .    116:  libmesh_assert_greater (std::distance (middle, end), 0);
         .          .    117:  libmesh_assert (std::unique (begin,  middle) == middle);
         .          .    118:  libmesh_assert (std::unique (middle, end) == end);
         .          .    119:
     350ms      350ms    120:  while (middle != end)
         .          .    121:    {
         .          .    122:      SparsityPattern::Row::iterator
         .          .    123:        b = middle,
         .          .    124:        a = b-1;
         .          .    125:
         .          .    126:      // Bubble-sort the middle value downward
    22.99s     22.99s    127:      while (!(*a < *b)) // *a & *b are less-than comparable, so use <
         .          .    128:        {
         .     26.38s    129:          std::swap (*a, *b);
         .          .    130:
         .          .    131:#if defined(__GNUC__) && (__GNUC__ < 4) && !defined(__INTEL_COMPILER)
         .          .    132:          /* Prohibit optimization at this point since gcc 3.3.5 seems
         .          .    133:             to have a bug.  */
         .          .    134:          SparsityPattern::_dummy_function();
         .          .    135:#endif
         .          .    136:
     2.19s      2.19s    137:          if (a == begin) break;
         .          .    138:
         .          .    139:          b=a;
         .      4.17s    140:          --a;
         .          .    141:        }
         .          .    142:
         .          .    143:      ++middle;
         .          .    144:    }

roystgnr commented 2 years ago

Look at the bright side; this means there's still a chance to optimize!

Assuming we can figure out why it's still getting in there. Even with all the combinations of ways that we can number or not renumber elements, we should still end up getting elem DoF indices that are numbered monotonically with increasing element number within each processor's local elements...

lindsayad commented 2 years ago

So my question is: will an element with a lower ID be guaranteed to have processor_id less than or equal to the higher ID element?

lindsayad commented 2 years ago

I'm going to guess the answer is no ... in which case given our rank-major ordering I would then understand how we could get into the inner loop and do swapping

roystgnr commented 2 years ago

Ah ... yeah, the answer is no. With a DistributedMesh, by default we renumber to make that true, but even there it's an option that can be disabled, not a guarantee.

roystgnr commented 2 years ago

Perhaps you could add a different comparator to the element map? Instead of sorting by element id, sort dictionary order by (processor id, element id)?

roystgnr commented 2 years ago

Although I'm wondering if we shouldn't just try keeping the row dofs themselves in a std::map. We might be able to get a perfect pre-sorting for incoming elem dof indices, but we'll never do it for vertex/edge/face indices.

lindsayad commented 2 years ago

Alright, with the PID ordering first before element ID ordering, we're all the way down to 9 seconds (actually was it 18? the graph makes it look more like 18) for handle_vi_vj -> sort_row. 156 seconds down to 9 seconds ... I'd say that's pretty good! So you think that SparsityPattern::Row should be a set? (You said map but I don't know the other member of the pair would be)

ordered-pid-inline

roystgnr commented 2 years ago

156 seconds down to 9 seconds ... I'd say that's pretty good! So you think that SparsityPattern::Row should be a set?

Yeah, "map" was a brain fart. A sorted binary tree, anyway. But I don't know if we could actually change the Row typedef without breaking APIs that expect an array, so we might just want temporary sets for each row before we copy the final row into the existing Row vector.

That might actually make your current case slower (although I'd still bet on it coming in under 15 seconds, fingers crossed), but it would probably make non-DG cases' performance much more robust.

lindsayad commented 2 years ago

I'll run the timings but intuitively I would expect that retaining the old GhostingFunctor::map_type and just changing Row to a std::set will not do much for the general case, at least if were to consider changing the current bubble sort to inplace_merge. Because std::set::insert is O(N*log(size() + N)) (N equaling the number of elements we want to insert) while the inplace_merge is O(M - 1) when memory is sufficient and O(M log M) when memory is insufficient (where M is distance from first to last) .... but now having written it out I can see that if N is sufficiently small and M is sufficiently large than the set really may be better...

lindsayad commented 2 years ago

coming in under 15 seconds

Wow, look at you @roystgnr, you're an oracle ...

      80ms     14.65s    177:          row->insert(element_dofs_j.begin(), element_dofs_j.end());
         .          .    178:
         .          .    179:        } // End dofs-of-var-i loop
         .          .    180:    } // End if-dofs-of-var-j

Call graph below. So do we think we prefer this change as opposed to changing GhostingFunctor::map_type?

row-as-set

lindsayad commented 2 years ago

Options are:

A: Row as set
B: ordered map type
C: bubble sort -> inplace_merge

For a contact problem (more of a traditional finite element problem), here are the timings for handle_vi_vj.

empty set: 7.60 seconds
A: 18.30 seconds
B: 7.33 seconds
C: 3.77 seconds
B + C: 4.70 seconds

And then summarizing the FV timings:

empty set: 177.31 seconds
A: 18.40 seconds
B: 18.73 seconds
C: 180.23 seconds
B + C: 20.45 seconds

Some more call graphs

### B + C FV ![fv-ordered-map-plus-inplace-merge](https://user-images.githubusercontent.com/10248304/200734397-5dda9878-4de5-4be4-bfef-8302fde28986.png) ### B + C contact Doesn't show any heaviness from ghosting/sparsity-pattern ![map-coupled-elems-inplace-merge](https://user-images.githubusercontent.com/10248304/200710755-08203bd6-b0f4-4ba7-ade9-6734f8035315.png)

roystgnr commented 2 years ago

What are the timings on contact, again? We've got 3 independent options here, right? Starting from "Original" (the empty set of changes {}) we could add A≡"Row as set", B≡"map_type as proc/dof ordered map", and/or C≡"bubble sort -> inplace_merge". 8 different combinations. For FV it looks like {} and {C} are unacceptable, {A} is great and {B} is better. For contact it looks like {C} is much better than {A}, but I'm not sure how those compare to {}.

And we haven't tried any multiple-change combinations yet, but I'm wondering what {B,C} might look like in both cases. Might just interfere with each other, I'd guess, but I'd love to see the timings to confirm or disprove that.

lindsayad commented 2 years ago

I will update the above timings box as they come in. I am indeed most curious about the combination of B and C and hence it's what I'm running this instant...

lindsayad commented 2 years ago

I should say that I don't think A and C are independent, at least I don't see where the value would be in doing both ... if we are doing an inplace_merge, then why use a set instead of a vector?

lindsayad commented 2 years ago

My conclusion from looking at these timings would be to go to the ordered map type and change the bubble sort to inplace_merge. What do you think?

roystgnr commented 2 years ago

Oh, yeah, with a set the inplace_merge doesn't make sense. We might fiddle with different ways of inserting but we wouldn't really be changing the asymptotics, just the constant.

My conclusion from looking at these timings would be to go to the ordered map type and change the bubble sort to inplace_merge. What do you think?

That should definitely be plan A. I'd put up a PR and see if anything breaks. Might be a bit of hassle if there are any ghosting functors downstream that aren't properly flexible about the map type, but the typedef's already there so it shouldn't be too hard to fix anything that isn't already using it.

libMesh / libmesh

A lot of time spent in Build::handle_vi_vj #3422