Closed lindsayad closed 2 years ago
@GiudGiud @snschune I generated this call-graph looking at @snschune's SFR inputs
What kind of stencil size do you have here?
It looks like sort_row
is trying to be smarter than std::inplace_merge
and failing. Maybe for tiny rows it really ran better by avoiding heap allocation, but it's O(N^2)
rather than O(N)
and I'd expect that to be a disaster for large rows. I'm going to go dig through git history now to try to figure out how I worked on a bubble sort here and didn't scream about it.
You might try replacing the implementation with inplace_merge
and see what happens.
What kind of stencil size do you have here?
It's pretty large ... the max columns per row is > 53 (I know that because I had to reconfigure the MOOSE AD derivative container size)
You might try replacing the implementation with
inplace_merge
and see what happens.
I'll try this soon and report back
And I don't think is relevant, but each element depends on at least two neighbor layers of elements (maybe even 3), so the element stencil is definitely larger than most libMesh-based simulations
I'm seeing an initial sort_row
implementation from Ben Kirk (as sort_sparsity_row
, in dof_map.h
) in May 2005. I bet the row sizes then really were so small that asymptotic problems didn't get triggered.
My own modification was just to clean up some assertions, so at least it might have been laziness rather than stupidity that led me to miss the performance issues right above. But I was specifically working on handle_vi_vj
for performance reasons, so I don't know how I missed the problem in its subroutines then. IIRC the ill-performing application was using a ton of variables with a very sparse coupling between them, so perhaps the individual row sizes in my case just weren't large enough to show up on profiling either.
I'll try this soon and report back
Thanks. If it works for your case I'm going to throw it at LIBMESH_BENCHMARK
too; I have a slight fear that switching to the std::
algorithm with heap allocation might actually make common cases slower, and I have a great fear that the exact opposite is true and all this time we've been noticeably pessimizing the common cases too...
Going to inplace_merge
effectively didn't change anything (158 seconds spent in the call to sort_row
vs 156 seconds before). Are you sure the algorithms are different? The doxygen that @jwpeterson wrote sure makes it seem like they should be algorithmically similar. Anyway it seems like maybe we can't do anything in sort_row
... which maybe you concluded in the past
Algorithms sure ought to be different. inplace_merge
is guaranteed O(N)
; our algorithm is doing a bubble sort, even taking advantage of the subsequences being sorted that's still O(N^2)
in the worst case.
I guess we might not be in the worst case? Our sort_row
is O(1)
in the pre-sorted case, which might actually be pretty common in our usage pattern. But it's really weird to see nearly the exact same results from inplace_merge
- I can see how we might easily be much faster or slower but I don't see how we'd end up being the same. And in the case where we'd be faster, I'd expect us to be so much faster that something else would be where the most expense is. Maybe the vector::insert
right before the sort_row
call, maybe the equal_range
searches, but something.
If I didn't see the 158 seconds in std::__merge_adaptive
I would think that my binary wasn't using the libMesh changes ... but I do see it
our algorithm is doing a bubble sort
Hmm yea it clearly says bubble sort even in the comment doesn't it 😆
So I'm using all CONSTANT MONOMIAL variables ... in that case I'm only ever coming in with 1 dof to add ... so the bubble sort algorithm in that case will be O(N*1), right? That's my hypothesis for the comparable execution times.
So if my reasoning is accurate, and we've got O(N) for both the bubble sort and the inplace-merge in this special situation, then that really suggests to me that the algorithmic improvement would have to be somewhere else if it's even possible. Maybe it's not. This could just be what life is like for these large element stencil finite volume methods ... we have way more coupled elements than compact support finite element methods, and we've got to loop over all of them when building the sparsity pattern
Would there be any merit in storing the coupled elements as a map sorted by element ID? Then maybe you get away with O(1) sorting most of the time (e.g. potentially don't have to do any swaps)?
In this finite volume case where everything is constant monomial, if the coupled elements were sorted by ID, then:
depending on whether you are doing variable grouping, you could either do outer coupled element loop-inner variable loop or visa versa and then the addition to the sparsity pattern should be ordered
Going to the sorted map type would probably only make sense for coupling functors 😦 At least in those cases the number of coupled elements is small and insertion shouldn't be all that impactful
I'm only ever coming in with 1 dof to add ... so the bubble sort algorithm in that case will be O(N*1), right?
Hmm... that's right. So either way, with N dofs and M randomly-ordered dofs per row, we ought to be seeing O(N*M^2) sparsity pattern build times?
the algorithmic improvement would have to be somewhere else if it's even possible.
And that seems to be right.
Maybe it's not.
But that's quitter talk!
Would there be any merit in storing the coupled elements as a map sorted by element ID?
So then the bubble sort in your case is O(1)
, and we should see O(N*M)
for the row sorting? Storing the map is more expensive than the current unordered map, but all the map storage works out to O(N*log(M))
, so we still beat O(N*M^2)
.
It'll be API-breaking, but it's a really easy API change to make so it's worth trying first, to see what your case gets out of it.
Less intrusively, we might keep the dof rows in maps, and only sort them into their final vectors at the end; I think that also gives us O(N*log(M))
?
std::map
has a bunch of constant-factor overhead due to the memory access patterns ... we could try using push_heap
etc. for either case. If we used a comparator to make a min-heap instead of a max-heap then we could even take advantage of the "inserts are mostly in increasing order" property... I don't think that gets us any asymptotic improvement, though; even if the actual heap insert is usually O(1)
, heap reallocation is O(log(N))
amortized and we wouldn't know in advance what to reserve()
to get around that (though we could probably come up with some pretty good heuristics...)
Ok with inplace_merge
and using a std::map
sorted by dof object ID, we went down to 127 seconds being spent on the call to sort_row
from handle_vi_vj
, which is about a 30 second improvement. Overall time (essentially simulation setup time), went from 281 seconds to 245 seconds
141 seconds in handle_vi_vj -> sort_row
with the bubble sort ... going to check and see whether I'm actually getting the "already sorted" behavior I expected
Oops yea I had identify_variable_groups = false
. Moving to true
, I now only have 56 seconds in handle_vi_vj -> sort_row
with the ordered map for coupled_elements
and the bubble sort... which still seems like an egregiously long time doesn't it? I believe we should just be incrementing the middle iterator once in sort_row
now and that should be the only non-comparison operation done in there
It's disappointing that there isn't really a great option for profiling inlined functions ... you can tell the compiler not to inline the function but then you've fundamentally altered the performance of the program you're trying to profile. Anyway, call graph for the last setup
It's disappointing that there isn't really a great option for profiling inlined functions
Have you tried perf annotate
? IIRC (although I probably do not RC; I'm thinking back to when "oprof" stood for "oprofile"...) there can still be some "leakage" of sample attributions from one line of code to a neighboring line, but you should still get much finer details than you can get from just the call graph.
I believe we should just be incrementing the middle iterator once in sort_row now and that should be the only non-comparison operation done in there
A couple logical operations, a couple dereferences ... and comparisons can be much more expensive than non-comparisons, screwing up CPU pipelining ... but yeah, 56 seconds seems way too long here. I'd be tempted to throw an abort into that inner loop and make sure you're never actually entering it in your use case.
Damn yea still getting in there 😦
. . 109:void sort_row (const SparsityPattern::Row::iterator begin,
. . 110: SparsityPattern::Row::iterator middle,
. . 111: const SparsityPattern::Row::iterator end)
. . 112:{
140ms 140ms 113: if ((begin == middle) || (middle == end)) return;
. . 114:
. . 115: libmesh_assert_greater (std::distance (begin, middle), 0);
. . 116: libmesh_assert_greater (std::distance (middle, end), 0);
. . 117: libmesh_assert (std::unique (begin, middle) == middle);
. . 118: libmesh_assert (std::unique (middle, end) == end);
. . 119:
350ms 350ms 120: while (middle != end)
. . 121: {
. . 122: SparsityPattern::Row::iterator
. . 123: b = middle,
. . 124: a = b-1;
. . 125:
. . 126: // Bubble-sort the middle value downward
22.99s 22.99s 127: while (!(*a < *b)) // *a & *b are less-than comparable, so use <
. . 128: {
. 26.38s 129: std::swap (*a, *b);
. . 130:
. . 131:#if defined(__GNUC__) && (__GNUC__ < 4) && !defined(__INTEL_COMPILER)
. . 132: /* Prohibit optimization at this point since gcc 3.3.5 seems
. . 133: to have a bug. */
. . 134: SparsityPattern::_dummy_function();
. . 135:#endif
. . 136:
2.19s 2.19s 137: if (a == begin) break;
. . 138:
. . 139: b=a;
. 4.17s 140: --a;
. . 141: }
. . 142:
. . 143: ++middle;
. . 144: }
Look at the bright side; this means there's still a chance to optimize!
Assuming we can figure out why it's still getting in there. Even with all the combinations of ways that we can number or not renumber elements, we should still end up getting elem DoF indices that are numbered monotonically with increasing element number within each processor's local elements...
So my question is: will an element with a lower ID be guaranteed to have processor_id
less than or equal to the higher ID element?
I'm going to guess the answer is no ... in which case given our rank-major ordering I would then understand how we could get into the inner loop and do swapping
Ah ... yeah, the answer is no. With a DistributedMesh
, by default we renumber to make that true, but even there it's an option that can be disabled, not a guarantee.
Perhaps you could add a different comparator to the element map? Instead of sorting by element id, sort dictionary order by (processor id, element id)?
Although I'm wondering if we shouldn't just try keeping the row dofs themselves in a std::map. We might be able to get a perfect pre-sorting for incoming elem dof indices, but we'll never do it for vertex/edge/face indices.
Alright, with the PID ordering first before element ID ordering, we're all the way down to 9 seconds (actually was it 18? the graph makes it look more like 18) for handle_vi_vj -> sort_row
. 156 seconds down to 9 seconds ... I'd say that's pretty good! So you think that SparsityPattern::Row
should be a set
? (You said map
but I don't know the other member of the pair would be)
156 seconds down to 9 seconds ... I'd say that's pretty good! So you think that SparsityPattern::Row should be a set?
Yeah, "map" was a brain fart. A sorted binary tree, anyway. But I don't know if we could actually change the Row typedef without breaking APIs that expect an array, so we might just want temporary sets for each row before we copy the final row into the existing Row vector.
That might actually make your current case slower (although I'd still bet on it coming in under 15 seconds, fingers crossed), but it would probably make non-DG cases' performance much more robust.
I'll run the timings but intuitively I would expect that retaining the old GhostingFunctor::map_type
and just changing Row
to a std::set
will not do much for the general case, at least if were to consider changing the current bubble sort to inplace_merge
. Because std::set::insert
is O(N*log(size() + N)) (N equaling the number of elements we want to insert) while the inplace_merge
is O(M - 1) when memory is sufficient and O(M log M) when memory is insufficient (where M is distance from first to last) .... but now having written it out I can see that if N is sufficiently small and M is sufficiently large than the set
really may be better...
coming in under 15 seconds
Wow, look at you @roystgnr, you're an oracle ...
80ms 14.65s 177: row->insert(element_dofs_j.begin(), element_dofs_j.end());
. . 178:
. . 179: } // End dofs-of-var-i loop
. . 180: } // End if-dofs-of-var-j
Call graph below. So do we think we prefer this change as opposed to changing GhostingFunctor::map_type
?
Options are:
inplace_merge
For a contact problem (more of a traditional finite element problem), here are the timings for handle_vi_vj
.
And then summarizing the FV timings:
What are the timings on contact, again? We've got 3 independent options here, right? Starting from "Original" (the empty set of changes {}) we could add A≡"Row as set", B≡"map_type as proc/dof ordered map", and/or C≡"bubble sort -> inplace_merge". 8 different combinations. For FV it looks like {} and {C} are unacceptable, {A} is great and {B} is better. For contact it looks like {C} is much better than {A}, but I'm not sure how those compare to {}.
And we haven't tried any multiple-change combinations yet, but I'm wondering what {B,C} might look like in both cases. Might just interfere with each other, I'd guess, but I'd love to see the timings to confirm or disprove that.
I will update the above timings box as they come in. I am indeed most curious about the combination of B and C and hence it's what I'm running this instant...
I should say that I don't think A and C are independent, at least I don't see where the value would be in doing both ... if we are doing an inplace_merge
, then why use a set
instead of a vector
?
My conclusion from looking at these timings would be to go to the ordered map type and change the bubble sort to inplace_merge
. What do you think?
Oh, yeah, with a set
the inplace_merge
doesn't make sense. We might fiddle with different ways of inserting but we wouldn't really be changing the asymptotics, just the constant.
My conclusion from looking at these timings would be to go to the ordered map type and change the bubble sort to inplace_merge. What do you think?
That should definitely be plan A. I'd put up a PR and see if anything breaks. Might be a bit of hassle if there are any ghosting functors downstream that aren't properly flexible about the map type, but the typedef's already there so it shouldn't be too hard to fix anything that isn't already using it.
This problem has ~29k local dofs (352,080 global dofs on 16 processes). There are 7 constant monomial variables (I believe actually a maximum of 6 on a given subdomain). I am seeing more time spent in
Build::handle_vi_vj
than I've ever seen before. Attached is the call-graph. Of the 177 seconds spent in the method, 155.78 seconds of it are spent on the call toSparsityPattern::sort_row
at the end of the method. I don't think that method shows up in the call-graph explicitly due to inlining. Are there potentials for algorithmic improvement here? It looks like @roystgnr is the primary author ofhandle_vi_vj
whilesort_row
may have been initially created by @dknez with later mods by @roystgnr and @jwpeterson