Improve DynamicSparseNumberFoo performance

roystgnr commented 9 years ago

The good news: we now have a working class that gives O(N_nonzero) instead of O(N) performance. This could be a big deal for https://github.com/libantioch/antioch sensitivity applications where our eventual N/N_nonzero will be ~1000.

The bad news: @friedmud reports that the current implementation has a constant performance overhead of ~1000. (based on profiling https://github.com/idaholab/moose/pull/5661, and according to the profiler the cost is mostly from new/delete). He fixed the obvious flaws (pass-by-value to the sparsity operations) to little effect.

So let's brainstorm ideas. I'll sort what I've got so far by increasing level of difficulty IMHO:

PBR: Pass-by-reference in the sparsity operations (even if it's not a big win it's still better than nothing)

RI: Use reverse iteration to do in-place sparsity operations rather than creating temporary merged_foo vectors.

RV: Add C++11 rvalue operations so we can steal doomed input arguments' allocations where available.

CA: Give a custom allocator to the underlying std::vector.

CC: Replace std::vector with a custom container that keeps the first O(N_nonzero) elements on the stack and only hits the heap for larger cases.

ET: Use expression templates to postpone evaluations until we can do more of them at once without creating so many intermediate temporaries.

friedmud commented 9 years ago

1c31b25 cuts residual computation time in half by simply trying to NOT compute derivatives during residual computation... but I'm sure it's not quite architected right (I'm still not great at template metaprogramming).

friedmud commented 9 years ago

If you completely turn off derivative calculation (by replacing the two if() statements in 1c31b25 with if(false)) the residual computation will only be double (7 vs 3.5) when using NumberArray.

It does also help when using DynamicSparseNumberArray... but unfortunately it just makes the time 49 seconds instead of 186!

friedmud commented 9 years ago

BTW: When using NumberArray the amount of storage set aside for derivatives has a HUGE impact on the runtime.

10: 5 seconds 100: 6 seconds 1000: 32 seconds 10000: Too long for me to wait

friedmud commented 9 years ago

That timing was with derivative computation off. Here it is with it on:

10: 5.5 seconds 100: 26 1000: 286

friedmud commented 9 years ago

Just trying to get to know the datastructures here... I implemented something called FriedmudArray that's backed by a std::map... that was an incredible failure (as you might imagine!). It came it at around 300 seconds.

(You can see this abomination here: c2d7dc890378e12ea1c840e6926f8419c5c763c7 )

Once again though: the problem is allocation / deallocation of the map. It's definitely just that we're making too many temporaries... but I don't know exactly what to do about it.

Like @roystgnr outlined... it's probably going to take either ET or RV. Neither of which is doable by me.

roystgnr commented 9 years ago

PBR and RI are implemented as of 1254b554 - the combination of which appears to do exactly jack for performance. Going to take a crack at RV next, but only after some perf runs on meta to see what I'm misunderstanding.

friedmud commented 9 years ago

What do you think about my attempts at removing derivative computation ( 1c31b2586ec7486546eb4d11b7c2a1ac980ffe33 )? I have a few more local commits that do even more... but to actually make it work well I think we need to change the architecture a bit.

I was thinking that we may need to store _derivatives as a pointer... and just not even initialize it at all unless the DualNumber is supposed to be doing derivative calculations. This would mean that we need to guard _derivatives a bit more (don't return a raw reference to it ever)... adding a bit more interface to DualNumber to allow you to get/set derivative entries (would would do nothing if that DualNumber is not supposed to be doing anything with derivatives).

This would keep us from ever creating (or destroying) the derivatives storage in the case where we're computing the residual....

roystgnr commented 9 years ago

Honestly? IMHO the way to avoid doing derivative computations with DualNumber is "don't pass in a DualNumber". If someone's code is sufficiently data type independent that you set T=DualNumber in the first place, it's usually generic enough that you can do one instantiation with T=DualNumber and another with T=double.

Otherwise we're tossing "if (_derivatives)" tests into every operation, which can't be good for performance. (although I admit it'd probably be slightly better than the implicit "if (!derivatives.empty())" equivalent that you'd get from dragging empty DynamicSparseNumberArray objects around) I'd much rather create a new "PotentiallyDualNumber" class than make such frequent branch instructions unavoidable.

roystgnr commented 9 years ago

We're still spending 52% of runtime in new/delete. More informatively, we're seeing 6 new/delete calls per computeQpResidual() call in Meta.

The dot product x*x+y*y+z*z would naively force construction of temporaries for xx, yy, zz, xxpyy, and xxpyypzz, then maybe a copy off the stack if the RVO can't be applied (although I'm not sure why it can't be, when it's just being used to initialize a temporary before temporary.value() is called...). This I guess accounts for all 6 calls.

With my move operations, on the other hand, the xxpyy temporary should swipe xx's memory, then be swiped in turn by xxpyypzz. If I understand C++11 correctly the return should be done via move constructor even if RVO doesn't apply...

But what I'm getting at here is that:

I clearly don't understand C++11 correctly or I'd be using only half as many copy constructions as valgrind is seeing. Fortunately this should be easy to hunt down.
Even if move constructiors were working they'd only be fixing half the problem here (thanks to the surfeit of non-rvalue inputs in this basic operation) and we really need to be using a custom allocator to fix the other half; that could make the difference between another 33% speedup and another 100%.

friedmud commented 9 years ago

Yes, yes.... all true computer scientists would say that the thing to do is template this up the wazoo to avoid the derivative computation....

I've gone over this in my head and I can't find a way to do that without completely infecting MOOSE and destroying the simple C++ we present to our users. This is the main reason why MOOSE hasn't had an AD capability from the beginning. Every time I looked at it, and what it did to the code, I ran screaming for the hills.

The biggest issue with templating the world is that it would necessitate two copies of everything. Two copies of every object (one instance with T=double one with T=DualNumber). That includes every material property... including stateful material properties (which are stored at every quadrature point). Objects that hold a lot of data (like Neutronics cross-sections) would have that data multiplied by two as well.

Yuck... just yuck. I would be ok with two copies of objects if it didn't mean two copies of data.

I mean, think about the Mesh. Like I mentioned to you, I would like to actually store the mesh coordinates as DualNumbers so that we could compute derivatives with respect to displacements properly... but if you go this templating route you would have to have two copies of your mesh! If you could just turn the derivatives "off" when you don't need them, that goes away.

Some of it is due to the way MOOSE is designed with inherited interfaces that serve up data references. It's not as simple as just templating computeQpResidual()... because all of the data feeding your residual calculation would also need to be templated... so the whole class would have to be templated and every class all the way down the inheritance tree would have to be templated as well. Yuck.

Interestingly this is actually one of those cases where a non-compiled code (like Python) can actually be more optimal. You could actually change what DualNumber does on the fly without needing to have an if() statement (you could just replace the routine with one that doesn't do anything with derivatives... at runtime). In a sense, this is similar to what the PETSc guys do... they have "dynamic objects" where they can replace function pointers on the fly to do things differently. It's one of the reasons why they don't like C++ and stick with C... because C++ is more rigid in this regard.

Doing this at runtime with C++ in MOOSE... I don't see any other option other than using an if(). But maybe I haven't thought about it long enough.

How difficult would a PotentiallyDualNumber be to implement?

I understand that it wouldn't be "optimal" in the computer science sense... but if we can get the runtime hit down to 2x or less over the "normal" MOOSE code then it will be viable.

On other topics... would it help if we use R-value reference returns for computeQpResidual()? That would only avoid one extra copy per call - but that is still a lot. We could definitely do that (and require C++11 for this new capability).

friedmud commented 9 years ago

Just to throw some more blasphemy in here... I'm actually thinking that a global (!) variable like MetaPhysicL::compute_derivatives would be useful. I could set it to true during Jacobian computation and false for residual computation.

Trying to set it per-variable is actually REALLY tricky.

I know that sounds insane... but I've been over this quite a bit in my head and it would actually be cleaner that way.

roystgnr commented 9 years ago

Updates:

I do understand C++11 moves correctly after all. I'm just seeing 6 allocations instead of 3 because I was forgetting that each of the three remaining DNSA copies does a separate allocation for _data and _indices. Trudging through in the debugger, the moves are all working fine.

There's no simple way to turn that 6 allocations into 3 via STL without using a vector-of-pairs, which would kill any chance at SIMD operations. I can't swear that'd be a bad idea, though, not when our memory allocation is currently twice as expensive as our floating point operations.

Using pointers and a custom static allocator instead of STL might be a better idea anyways. The extra complexity of having to replace "= default" methods with custom code doesn't look any worse than the extra complexity of the STL allocator interface.

The C++ way to change function pointers on the fly is with inheritance. A vtable itself isn't so bad, but the fact that you're using function pointers at all (whether in C or C++) prevents inlining, which for small function bodies makes baby inner loops cry. Great idea for PETSc vectors where each function body is O(N_dofs), not so great here where we were hoping for O(1).

Non-compiled codes can theoretically be better at this stuff with a smart enough JIT backend. Last I checked Python didn't qualify but if you sacrificed a goat you had a chance with Java.

We should test DualNumber<Real,empty-DSNA> and see how it performs before creating a PotentiallyDualNumber.

PotentiallyDualNumber would be little more than a cut-and-paste-and-tweak job; tedious but not hard.

We wouldn't want a MetaPhysicL::compute_derivatives global, but a static bool PotentiallyDualNumber::compute_derivatives is almost certainly a better idea than a non-static for your use case.

roystgnr commented 9 years ago

Testing DualNumber<Real,empty-DSNA> was easier done than said. I currently get runs in 7 seconds with the simple kernel, 11.5 seconds with an empty-derivative DualNumber, and ~55 seconds with a full DualNumber. PerfLog says Meta spends 50% of its time in residual evaluations (not counting subroutines like the FE reinit) in the simple case and 66% in the empty-derivative case, so we're looking at 3.5 seconds with Real and 7.66 with empty-DualNumber. Not quite 2x overhead but it's within spitting distance.

friedmud commented 9 years ago

The C++ way to change function pointers on the fly is with inheritance.

Inheritance doesn't work on the fly. Essentially, what the PETSc guys have is the ability to modify the vtable during the execution of the program. So they can change the behavior of already built objects. You are definitely correct though that their indirection through a function pointer removes the possibility of inlining and thus the possibility of vectorization.

At any rate - that was really more of a tangent / musing ;-)

What is empty-DSNA? Does it not have any vectors at all... or just no resizing of them?

Static bool in the class would be fine... I was thinking though that we may be using potentially different template instantiations of DualNumber in different places... which would mean multiple static variables (one for each combination of templating) - but I'm not sure on that yet. In the beginning we can ignore that ;-)

Your 3.5 second vs 7.66 seconds is fine with me. 2x is a number I pulled out of my bum... anything on that order is easy to explain: "You can use the old stuff and it will be faster... or you can have automatically computed Jacobians and it will be somewhat slower... your choice" ;-)

friedmud commented 9 years ago

Oh - if you do your own pointer and memory pool... don't forget about threading.

roystgnr commented 9 years ago

empty-DSNA is what you get if you use DynamicSparseNumberArray but you comment out the call to insert() so the derivatives vectors remain empty. No data, no indices, no allocation calls.

3.5 seconds vs 7.66 seconds may be about as good as it's going to get in the empty case, I'm afraid. I tried some "perf annotate", looked through callgrind again, etc., but I don't see any obvious hot spots or mechanism for optimization that wouldn't be a pessimization in the non-empty case.

I do think the non-empty case still has room to improve, but I'm not sure how much. Meta is taking derivatives w.r.t. 8 dofs per element, right? And it looks like the cost there is a ~14-fold slowdown over a plain-Real residual-only calculation. I'd hope we can cut that in half with a smart allocator but that may be the limit of it. Manually optimizing an analytic Jacobian calculation involves things like avoiding redundant calculation in symmetric terms, which we can't do automatically at such a low level.

friedmud commented 9 years ago

Ok - so empty-DSNA is basically what we could get if we implement PotentiallyDualNumber? Or could we do even better in that case because we don't even need to allocate the DSNA at all for all the temporaries?

roystgnr commented 9 years ago

The only remaining allocation in empty-DSNA is on the stack as part of the DualNumber, and so is nearly free (there's a chance that the extra bytes will increase cache misses).

There are currently two tests for indices.begin()==indices.end() in each empty-DSNA operation, though, which might be twice as much overhead as a single if(compute_derivatives) test...

roystgnr commented 9 years ago

Switching to Howard Hinnant's stack allocator and futzing with the size option a bit gives me no improvement, maybe a couple percent slowdown, no matter how large I make the "arena".

I wonder if we could pull indices vectors from a static pool of vectors at a higher level... use copy-on-write to share them as much as possible, use a hash table of them to avoid creating duplicates... then the common case where two operands have the same indices would be O(1) instead of O(N_nonzero) in the sparsity work. Except we might lose much more than we gain when we have multiple threads contending for the pool...

roystgnr commented 9 years ago

Okay, putting this down for a while. I think I've exhausted all the low-hanging fruit.

friedmud commented 9 years ago

@roystgnr thanks for everything you've done! I've done some more monkeying around and I think we're in good shape to go ahead and use this and we can do more optimization later.

I wanted to see if using pointers would speed some of this up (because we wouldn't need to allocate / deallocate at all if we weren't computing the derivatives).

I liked what you were doing with just NOT inserting derivatives... that's actually great (and actually, it's really easy in MOOSE to just NOT insert derivatives during residual calculation... so that's really a viable thing)... so I wanted to only allocate _data, _indices (or possibly _derivs) only if there was an insertion for derivatives.

I have three branches:

https://github.com/friedmud/MetaPhysicL/tree/data_pointer : Make _data and _indices pointers inside DSNA.

https://github.com/friedmud/MetaPhysicL/tree/deriv_pointer : Make _deriv in DualNumber a pointer. Only allocate them if they're needed.

https://github.com/friedmud/MetaPhysicL/tree/both_pointers : Both

Here are the current timings (with DSNA, but no insertion):

No AD: 3.5 s Current MetaPhysicL head: 8.4 s data_pointer: 6.2 s deriv_pointer: 5.7 s both_pointers: 5.64 s

Here's an interesting thing... the destructor takes up a lot of time. If I remove the destructor from DualNumber in deriv_pointer then it runs in 4.4 s... even though no derivatives are ever being allocated. Just the act of calling that many destructors eats up quite a bit of time!

Ok - so I was able to cut down the time quite a bit by doing this... BUT my solutions aren't perfect. They still segfault if you DO actually try to compute a Jacobian (Possibly an issue with using default move operators? I don't know).

I'm sure that @roystgnr can do better than me... I just wanted to know if it would make any difference... AND IT DOES.

So... one thing I'm thinking is that maybe changing DSNA to use a plain old C++ array pointer instead of std::vector might be the way to go. It seems to me that just allocating and deallocating the std::vectors themselves (not their data... because I was never inserting any data into any std::vector) is taking too much time...

dschwen commented 5 years ago

Came here expecting to see @friedmud proposing to apply his object pool idea to this. Not a good idea?

roystgnr commented 5 years ago

"object pool" would be a subset (and probably the most tempting subset) of "custom allocator", wouldn't it? Unless I'm misunderstanding, I do think that's worth trying, but I was pretty disappointed when my stack allocator experiment didn't show me any improvement.

lindsayad commented 5 years ago

So I just tried changing our ADReal typedef in MOOSE:

-typedef DualNumber<Real, NumberArray<AD_MAX_DOFS_PER_ELEM, Real>> ADReal;
+typedef DualNumber<Real, DynamicSparseNumberVector<Real, unsigned int>> ADReal;

The results for a canonical lid-driven flow problem in our test suite are:

Hand-coded: 5.145 seconds
Static AD type with AD_MAX_DOFS_PER_ELEM tuned to this particular problem: 5.361 seconds
Dynamic AD type: 10.333 seconds

With the doom and gloom at the beginning of this ticket I was expecting much worse results...

lindsayad commented 5 years ago

@roystgnr What's your thought on having something like DynamicSparseNumberBase but with the underlying data members as std::array? I'll call it HybridClass here. HybridClass would have a size_t template parameter that establishes the length of the std::array data member. It would also have a _size member that changes with calls to insert. sparsity_union would more or less look the same. HybridClass::resize would refer to the internal member _size.

Naively, HybridClass should be something like SparseNumberArray/Vector. However, @friedmud and I were looking at that code today, and as far as we could tell the sparsity pattern needs to be known at compile time?

roystgnr commented 5 years ago

So the motivation for HybridClass is that we'd avoid the dynamic memory allocation, but we'd still keep the operations sparse? That sounds like an optimization worth trying, yeah. The idea is that we'd treat ArraySparseNumberArray like DynamicSparseNumberArray, but if the dynamic _size ever exceeds the preallocated max_size then we're screwed?

You could probably also see if it's at all worthwhile to try a version of DSNA that uses a max_size template parameter but to just reserve(max_size) on an underlying vector when it's first allocated. That would avoid the heap juggling and copying on resize, and it could even be faster in cases where a copy constructor is used inadvertently since it wouldn't copy the reserved-but-unused storage. It wouldn't avoid the "_size" vs max_size tests, but by that token it would also be safer if you're ever thinking about trying to get away with max_size larger than the length of a dense row.

I guess that's not an either/or option, though; you could try both and then switch between them at compile time just like any other faster-vs-safer tradeoff.

something like SparseNumberArray/Vector. However, @friedmud and I were looking at that code today, and as far as we could tell the sparsity pattern needs to be known at compile time?

Yeah, that's true. It's awesome to have all the juggling done once per operation at compile time, instead of once for every single invocation of every operation, but it only works if the pattern doesn't depend on e.g. input files, which probably rules it out for MOOSE use. It would be possible to create a hybrid (compile-time sparsity where possible for things like momentum equation terms that falls back on run-time sparsity for things like chemical reaction terms) but probably not worth the amount of work that would take.

lindsayad commented 5 years ago

So the motivation for HybridClass is that we'd avoid the dynamic memory allocation, but we'd still keep the operations sparse? That sounds like an optimization worth trying, yeah. The idea is that we'd treat ArraySparseNumberArray like DynamicSparseNumberArray, but if the dynamic _size ever exceeds the preallocated max_size then we're screwed?

Exactly

I guess that's not an either/or option, though; you could try both and then switch between them at compile time just like any other faster-vs-safer tradeoff.

👍 Yea, I think we'll give the ArraySparseNumberArray a shot. Your other suggestion is also a good one.

probably rules it out for MOOSE use

Unfortunately yea 😢

lindsayad commented 5 years ago

You could probably also see if it's at all worthwhile to try a version of DSNA that uses a max_size template parameter but to just reserve(max_size) on an underlying vector when it's first allocated. That would avoid the heap juggling and copying on resize, and it could even be faster in cases where a copy constructor is used inadvertently since it wouldn't copy the reserved-but-unused storage. It wouldn't avoid the "_size" vs max_size tests, but by that token it would also be safer if you're ever thinking about trying to get away with max_size larger than the length of a dense row.

Sigh, I tried this patch

reserve.txt

where I reserved the maximum capacity we might have for local derivatives (3 variables on QUAD4 elements = 12) up front, and the performance worsened for my test navier-stokes problem. 2.8 minutes -> 3.75 minutes. Biggest jump came in libsystem malloc calls stemming from operator new, where I went from 1.07 minutes to 1.55 minutes.

lindsayad commented 5 years ago

I did successfully convert sparsity_union from being the heaviest function to not appearing at all in the timing, so I guess I know I conceptually accomplished my goal 😆

dschwen commented 5 years ago

Biggest jump came in libsystem malloc calls stemming from operator new, where I went from 1.07 minutes to 1.55 minutes.

Pool based allocation should be ideal for this application.

lindsayad commented 5 years ago

Here's a summary of my Jacobian timings so far with the ad_lid_driven_stabilized test case in MOOSE, with the following derivative types:

NA: NumberArray with N = 50
DNSA: DynamicSparseNumberArray
up-front DNSA: DNSA but with reserving 12 slots upon object construction
SDNSA: SemiDynamicSparseNumberArray from #53 with N = 50 before e6c77f1

less memmove SDNSA: SDNSA but with the patch:


--- a/src/numerics/include/metaphysicl/dualnumber.h
+++ b/src/numerics/include/metaphysicl/dualnumber.h
@@ -174,7 +174,8 @@ template <typename T, typename D, typename T2, typename D2>
inline void
derivative_multiply_helper(DualNumber<T, D> & out, const DualNumber<T2, D2> & in)
{

out.derivatives() = out.derivatives() in.value() + out.value() in.derivatives();
out.derivatives() *= in.value();
out.derivatives() += out.value() * in.derivatives(); }
fitted SDSNA: less memmove SDNSA but with N = 12
selective copy SDSNA: less memmove SDNSA but with e6c77f1 (e.g. still with N = 50)
fat selective copy SDSNA: selective copy SDSNA but with N = 100
no virtuals SDSNA: fat selective copy SDSNA but now with b312f9f (e.g. with N = 100)

Results

NA: 13.3 seconds
DSNA: 2.55 minutes
up-front DSNA: 3.51 minutes
SDSNA: 20.17 seconds
less memmove SDSNA: 16.75 seconds (4.9 seconds in _platform_memmove)
fitted SDSNA: 12.73 seconds (2.27 seconds in _platform_memmove)
selective copy SDSNA: 13.64 seconds (2.00 seconds in _platform_memmove)
fat selective copy SDSNA: 13.38 seconds (1.96 seconds in _platform_memmove)
no virtuals SDSNA: 12.47 seconds (1.82 seconds in _platform_memmove)

The selective copy SDSNA results really seem quite promising to me because the timing is independent of N which would be huge for things like 3D problems, e.g we could change our N in MOOSE from 50 to 100, which would immediately enable some simulation that couldn't be done before. Moreover SDSNA would be fantastic for applications with large numbers of variables but that still have relatively small sparsity.

dschwen commented 5 years ago

fat selective copy SDSNA: 13.38 seconds

Impressive!

lindsayad commented 2 years ago

It's funny that MOOSE AD 1.0 and 2.0 are described conceptually in back-to-back comments in https://github.com/roystgnr/MetaPhysicL/issues/1#issuecomment-139273980 and https://github.com/roystgnr/MetaPhysicL/issues/1#issuecomment-139274957 respectively

lindsayad commented 2 years ago

Came here expecting to see @friedmud proposing to apply his object pool idea to this. Not a good idea?

Doesn't an object pool really only make sense with fixed-size types? From what I understand, @friedmud's SharedPool is a LIFO design. Generally speaking, dual number derivative data can be of variable size. You could pop off a derivative container that is size X when the next operation actually requires a container that is size Y. This is in the context of considering using a std::vector concept instead of std::array, the latter of which is what MOOSE uses at the present time.

friedmud commented 2 years ago

I think we already talked about it - but I'm just dropping a note here so we don't collectively forget: the SharedPool should work fine for this. What will happen is that the size of the vectors will all tend toward the largest number of nonzeros. In the beginning there will be quite a few resizes until it normalizes. But it will work.

lindsayad commented 2 years ago

Yea I'm excited to try it out

roystgnr / MetaPhysicL

Improve DynamicSparseNumberFoo performance #1

Results