libMesh / libmesh

libMesh github repository
http://libmesh.github.io
GNU Lesser General Public License v2.1
651 stars 285 forks source link

FE Optimization #890

Open friedmud opened 8 years ago

friedmud commented 8 years ago

So... I have a VERY time sensitive application. It is not actually doing FE... but it does need to evaluate FE variables at arbitrary points (actually, currently just one point) within each element. Here is the basic loop it's doing:

  1. Determine the point within the element
  2. Inverse map that point into the element's reference space
  3. FE::reinit() with that reference point
  4. Compute the value of variables using the shape functions

(This is using MOOSE... so it's using the FEProblem::reinitAtPhysical() code path. There is some overhead in MOOSE as well which I'm already addressing here: https://github.com/idaholab/moose/pull/6676 )

Also... my application needs to obtain surface normal vectors. It doesn't need any other information on the surface other than a normal vector. This is where the problems start.

The FE objects are supposed to be "smart" and only compute what they're asked for. The reality falls far short of that. Especially in FEMap... it does LOTS with second derivatives (including a LOT of resizing and memory allocation and zeroing) even if they're not being asked for.

In the case where all I need is a normal vector... I also have to call FE::get_phi() or the FE object will happily compute everything (I understand why... but there should maybe be an override). Currently I am calling get_phi() even for my face FE just to keep it from computing more stuff (I don't actually need phi!). In that same vein... I don't need JxW at all because I'm not doing any integration...

Also: in general, there is too much allocation of temporary objects happening in FE and FEMap. This shows up directly in timing (even of normal MOOSE applications) and hinders threaded scalability (memory allocations and deallocations are mutexed).

This is absolutely critical to me right now. My application is running about 5x slower than I need it to... and the majority of the time is going into the FE objects. This is all for some huge runs I need to do for a conference that is a month away (where I have an hour long plenary talk). I've barely slept in the last two weeks as I've been working on this... and I don't plan on sleeping much for the next month. ANY help you guys can give me on optimizing this I would GREATLY appreciate!

Ok. So, in my normal way I've made a little test application to help figure out where the time is going and allow us to make optimizations and see their effect. That application can be found here: https://github.com/friedmud/simple_libmesh_app/tree/fe_optimization (note that it's the fe_optimization branch in that repo).

Right now it's just a simple little application that runs FE::reinit() on every element. You can run the application like this:

./simple-opt --n-elem 10 --n-sweeps 10 --fe-reinit-qrule --fe-reinit-custom

And you will see this:

Reinit with qrule...
Total Time: 0.000365026
ns/element: 365.026
Levelized ns/element: 365.026

Reinit with custom point...
Total Time: 0.000769435
ns/element: 769.435
Levelized ns/element: 769.435

("Levelized ns/element" is the ns/element multiplied by the current number of processors being used (mpi*threads) it makes it easy to look at parallel speedup)

What this is currently showing you is that reiniting with a custom point is currently taking nearly double the amount of time as reiniting using a qrule. (Look here if that's not clear: https://github.com/friedmud/simple_libmesh_app/blob/fe_optimization/include/ThreadedComputation.h#L55 ). I really need that custom point evaluation time to come WAY down.

I have run both of these cases in Instruments. Here's what I get for the qrule case:

qrule_timing

Here's what I get for the custom case:

custom_timing

As you can see... one of the big differences is that the custom one spends a LOT of time in FEMap::init_reference_to_physical_map() which is right here: https://github.com/libMesh/libmesh/blob/master/src/fe/fe_map.C#L63 . It appears as though there is a lot of time spent in that routine dealing with second derivatives... even though they aren't being computed here (among other things).

Also: look at how much std::vector::assign() is taking up in both cases. Even worse... look at how much time is going into mallocing and freeing objects! In the "custom" point case I see about 20% of the overall runtime going into allocation/deallocation! That's just tons of temporaries getting created and destroyed. We've got to be able to fix a lot of that!

Ok - so hopefully you guys have an idea of what I'm up to here and you see what I'm after. Any help I can get on this I would very MUCH appreciate! I've already lost quite a bit of time to this already.

Feel free to fork my application and add to it / modify it. It needs some more studies (like one on inverse mapping a point!). I accept pull requests :-)

friedmud commented 8 years ago

Oh - to give you an idea of the time I'm looking at here. My "pure" application without any FE stuff in it takes ~80 ns/element. That will double to about 150-200 ns/element once I put some more physics in it.

After adding in the inverse map and MOOSE FEProblem::reinitAtPhysical() I'm currently seeing ~3000 ns/intersection! A 10x increase from my "pure" application. Some of that is ok... I know it's going to take some time to evaluate shape functions and variables... but I can't stomach 3000 ns/element... my runs will take forever.

friedmud commented 8 years ago

I thought you guys might want to see what I'm actually seeing in my application. So here's a profiling run (I also "rolled up" a lot of the memory allocation/deallocation time so we can get a good idea of where time is being spent):

mocadile_profile

That FEMap::init_reference_to_physical_map() is definitely the big killer...

roystgnr commented 8 years ago

@pbauman, what was the final status of that vectorizable code you had a student working on? I know we were originally interested in swapping stride directions under the hood, but I wonder if we also win anything by being able to resize n_dofs*n_qp arrays with a single reallocation rather than n_dofs reallocations.

roystgnr commented 8 years ago

The FE objects are supposed to be "smart" and only compute what they're asked for.

Originally they were supposed to always compute the kitchen sink. Then I got worried enough to change that when adding second derivative support, since computing 13 terms at a time when all preexisting applications only needed 4 would have been grossly unacceptable for efficiency. We didn't compute mapping second derivatives at that point, though, because I was too lazy to bother with non-affine elements... and when John fixed that, I was so happy to have correctness that I didn't even think again about efficiency.

Making the mapping 2nd derivative calculation conditional on calculate_d2phi would be straightforward enough.

What I'd really like to do, though, is also make mapping 1st derivative calculations conditional on some new "calculate_J_map" (which would then be turned on by get_dphi() and anything else which needs it)... but I wonder if we can do that without breaking user code. Code which prerequests nothing will still work (and still be inefficient), code which (pointlessly, right now) includes get_JxW() in its prerequests will still work, but code which tries to do an L2 projection, prerequests get_phi(), but doesn't prerequest get_JxW(), will segfault as the additionally-optimized FE object doesn't bother doing the JxW allocation or calculation.

friedmud commented 8 years ago

OMG Roy... thanks so much for helping on this! I am super stuck right now if this doesn't get faster (like, it's waking me up at night (when I do eventually get a nap)).

I don't think very much user code would rely on NOT pre-requesting get_JxW(). I know that we always pre-request all information from the FE object in the Examples. The implication has always been that if you don't pre-request it, it won't be computed. I think most people are already operating under that assumption for everything from FE.

If we could just make everything dependent on whether or not it's requested that would be awesome.

What do you think about some of the memory stuff? A really big chunk of the time is going into memory allocation/deallocation in the FE routines. There are tons of temporaries used in there that are eating up massive amounts of time. One fix might be to make FEMap a real class... so that it can have "work vectors" and "work tensors" allocated within it... FE would just have an instance of FEMap it would just hold on to.

I know that's a bigger design change though. Maybe there is some way to fix the issue without doing that large of a redesign...

roystgnr commented 8 years ago

I'm juggling other stuff so can't focus solely on this, but I'm very happy to help. I noticed the same problems when reoptimizing the projection code, but didn't have time to fix them alone, and didn't trust myself to do the optimization based on application runs where (once I fixed the other inefficiencies) the troublesome parts weren't the majority of run time.

Your test case seems to fix that, thanks! Is it the fe-reinit-qrule or fe-init-custom case that's more critical to your real app?

roystgnr commented 8 years ago

Sorry, missed that bit; looking at fe-init-custom.

roystgnr commented 8 years ago

As a workaround on the application side: is there any chance that you can create a custom qrule? It seems paradoxical that the qrule case runs faster, but there's actually a good reason for that: the FE checks to see if the element type and qrule haven't changed since the previous element, and when that holds true the FE avoids some recomputations. With a plain vector of points, on the other hand, there's no way to know whether recomputation is necessary and so FE recomputes everything.

friedmud commented 8 years ago

We actually do use a custom qrule - but I believe it ends up calling through basically the same path (because it sets shapes_need_reinit or whatever). That's because the points really are changing every time through in my application.

I'll get a custom qrule version into the test code later today.

I'm also going to add a inverse_map() timing loop in there too

Thanks again for your help!

pbauman commented 8 years ago

@pbauman, what was the final status of that vectorizable code you had a student working on? I know we were originally interested in swapping stride directions under the hood, but I wonder if we also win anything by being able to resize n_dofs*n_qp arrays with a single reallocation rather than n_dofs reallocations.

IIRC, Tim left his branch available. He hadn't tried to drop in the new class, but I think the API was consistent so a typedef might be an option to get up and running quickly. He was cognizant of the memory allocations, so you might be correct that this could help. @tradowsk, please chime in.

friedmud commented 8 years ago

I've updated the test application. I added a new --inverse-map option that simply inverse maps one point per element.

I currently see:

./simple-opt --n-elem 1500 --n-sweeps 40 --inverse-map

Inverse map...
Total Time: 17.9967
ns/element: 199.963
Levelized ns/element: 199.963

And here is what Instruments tells us:

inverse_map_timing

As you can see from Instruments... a good chunk of the time is going into the shape function computations.

boyceg commented 8 years ago

FWIW, I have seen similar things and have been contemplating aggressively caching these types of quantities (this is for a total Lagrangian calculation without AMR).

On Apr 5, 2016, at 12:25 PM, Derek Gaston notifications@github.com wrote:

I've updated the test application. I added a new --inverse-map option that simply inverse maps one point per element.

I currently see:

./simple-opt --n-elem 1500 --n-sweeps 40 --inverse-map

Inverse map... Total Time: 17.9967 ns/element: 199.963 Levelized ns/element: 199.963 And here is what Instruments tells us:

As you can see from Instruments... a good chunk of the time is going into the shape function computations.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub

friedmud commented 8 years ago

I traced down where the shape function computation is in inverse_map() and it comes from fe_map.C:1735

template <unsigned int Dim, FEFamily T>
Point FE<Dim,T>::map (const Elem * elem,
                      const Point & reference_point)
{
  libmesh_assert(elem);

  Point p;

  const ElemType type     = elem->type();
  const Order order       = elem->default_order();
  const unsigned int n_sf = FE<Dim,LAGRANGE>::n_shape_functions(type, order);

  // Lagrange basis functions are used for mapping
  for (unsigned int i=0; i<n_sf; i++)
    p.add_scaled (elem->point(i),
                  FE<Dim,LAGRANGE>::shape(type,
                                          order,
                                          i,
                                          reference_point)
                  );

  return p;
}

While I'm not sure there's anything that can be done to optimize this for the shape functions. One BIG thing that I see is the temporary Point! Create a new one every time you call map() (it's called in a newton loop in inverse_map()... and then copy it out...

How about passing in a Point by reference?

pbauman commented 8 years ago

Does the Return Value Optimization not save you here?

boyceg commented 8 years ago

Not according to Instruments. :-)

On Apr 5, 2016, at 12:33 PM, Paul T. Bauman notifications@github.com wrote:

Does the Return Value Optimization not save you here?

— You are receiving this because you commented. Reply to this email directly or view it on GitHub

pbauman commented 8 years ago

Not according to Instruments. :-)

I don't see the Point copy constructor in any of the Instruments timings here. I just see FE::inverse_map at the top...

boyceg commented 8 years ago

On Apr 5, 2016, at 12:45 PM, Paul T. Bauman notifications@github.com wrote:

Not according to Instruments. :-)

I don't see the Point copy constructor in any of the Instruments timings here. I just see FE::inverse_map at the top...

I don't think I have sent in any timings...

-- Boyce

— You are receiving this because you commented. Reply to this email directly or view it on GitHub

pbauman commented 8 years ago

I don't think I have sent in any timings...

Sorry, I meant Derek's timings (which is what I assumed you were referencing).

boyceg commented 8 years ago

On Apr 5, 2016, at 12:49 PM, Paul T. Bauman notifications@github.com wrote:

I don't think I have sent in any timings...

Sorry, I meant Derek's timings (which is what I assumed you were referencing).

I can send some if anyone wants to take a look. I was really just chiming in as someone else who would be really happy to see these kinds of optimizations! The caching code that I am thinking about putting together is really kind of gross.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub

friedmud commented 8 years ago

In theory Return Value Optimization should help - but it won't fix all of them. Just look at all of the temporaries that are getting created and destroyed in inverse_map(). Most of these are getting created and destroyed for every iteration of the Newton loop...

https://github.com/libMesh/libmesh/blob/master/src/fe/fe_map.C#L1344 https://github.com/libMesh/libmesh/blob/master/src/fe/fe_map.C#L1363 https://github.com/libMesh/libmesh/blob/master/src/fe/fe_map.C#L1366 (there might actually be 2 here) https://github.com/libMesh/libmesh/blob/master/src/fe/fe_map.C#L1394 https://github.com/libMesh/libmesh/blob/master/src/fe/fe_map.C#L1441 https://github.com/libMesh/libmesh/blob/master/src/fe/fe_map.C#L1442 https://github.com/libMesh/libmesh/blob/master/src/fe/fe_map.C#L1503 https://github.com/libMesh/libmesh/blob/master/src/fe/fe_map.C#L1504 https://github.com/libMesh/libmesh/blob/master/src/fe/fe_map.C#L1505

This next one is seriously problematic. It's allocating and destroying a whole RealTensorValue for every iteration of the newton loop in inverse_map(): https://github.com/libMesh/libmesh/blob/master/src/fe/fe_map.C#L1519

@boyceg I would definitely be interested in seeing your timings!

friedmud commented 8 years ago

Part of the problem here is that inverse_map() is a static function. That means that there can't be any class member variables to use as "work vectors". The only options are either to pass in work vectors (which is tedious at best)... or just deal with tons of allocation. Now, many of these can at least be moved outside the Newton loop so they are allocated once... but that's still allocation I would like to avoid.

I'm up for ideas on how to fix that. Should we just NOT have these functions be static... and require that you have a valid FE object hanging around to call them on? In user code that would definitely work (because user code almost always has FE objects hanging around anyway)... but it might have negative implications for all of the places this is used in the library...

friedmud commented 8 years ago
pbauman commented 8 years ago

Nice, glad you're pinning it down (curse you C++ compiler for not optimizing this!). Re: static, this is static because there's an interface to the inverse_map in FEInterface. I don't have any cycles ATM to make suggestions, but certainly we should try and reduce the number of temporary Points.

friedmud commented 8 years ago

I just added a --dim option to the testing application. Use it like this:

./simple-opt --n-elem 20 --n-sweeps 40 --inverse-map --dim 3

Inverse map...
Total Time: 0.160041
ns/element: 526.451
Levelized ns/element: 526.451
roystgnr commented 8 years ago

Wait, where is Node::Node() getting called from inverse_map? We should be constructing Points there, never Nodes.

tradowsk commented 8 years ago

IIRC, Tim left his branch available. He hadn't tried to drop in the new class, but I think the API was consistent so a typedef might be an option to get up and running quickly. He was cognizant of the memory allocations, so you might be correct that this could help. @tradowsk, please chime in.

My shape function class is still up on my libMesh fork: fe_shape_function_1d.h The class maintains the same shape function API that is already implemented, and uses the correct stride for storing the shape functions. However, timing runs on my (simple) Poisson example showed insignificant effects on overall runtime, and autovectorization by the compiler (several GCC versions) was difficult-to-impossible. Perhaps combined with other refactoring this would make more of a difference. And just a note, my shape function class might not fully implemented, in that there may be other API functions that still need to be done. I only focused on what I needed to get my test case working.

friedmud commented 8 years ago

@roystgnr you are absolutely right about Node(). I can't see what's happening there at all. I would say it might have something to do with my nodal coordinate averaging... but I can't see how...

I'm in class right now - but I'll dig some later...

friedmud commented 8 years ago

@roystgnr was right... it was just a bad timing (I think I got some of the mesh creation time in there).

Here is the real timing for 3D. Sorry about that!

I'm going to delete my earlier post as there is no need for misinformation on here...

3d_inverse_map_timing
lindsayad commented 5 years ago

So based on discussion ibamr/ibamr#436, I'm wondering whether we can add a fairly specific optimization where we cache an initial phi evaluation and then avoid re-calculating phi when the quadrature rule isn't changing? @drwells

drwells commented 5 years ago

I must admit that I don't know a ton about libMesh outside of the little bit we use in IBAMR so some of this might not make sense.

The problem is this: function values are not always translation invariant (e.g., Raviart Thomas): libMesh would need some way to compute, dependent on the finite element and (possibly) mapping, whether or not this property holds. IIRC there is some code in libMesh which attempts to avoid recomputing all values if the next element is a translation of the current one: ideally, the 'can we avoid recomputing shape functions' code would be integrated together with this. It would require more than just booleans.

roystgnr commented 5 years ago

IIRC there is some code in libMesh which attempts to avoid recomputing all values if the next element is a translation of the current one: ideally, the 'can we avoid recomputing shape functions' code would be integrated together with this. It would require more than just booleans.

Take a look at what's going on in FE::reinit. There may be an inadequacy or a bug, but the 'can we avoid recomputing shape functions` code is mostly there.

If we're handed a vector of points rather than using our attached quadrature rule, we recompute.

If we're handed a quadrature rule, but the rule changes because this element type or p level is different from the previous one, we recompute.

If the nodes are just a translation of the previously evaluated element, and if the FE type's shape functions depend on the particular element, then we recompute.

Hmm... but it looks like you're talking about a case where there really is an inadequacy!? If you aren't on a translation of the previous element but the FE type's shape functions are independent of element, then we recompute regardless. Trying to remember what I was thinking ten years ago, my guess would be that I figured that most people in this case are asking for basis gradients or higher rather than just values, and we have to recompute gradients anyway so we might as well spend a little more time recomputing everything rather than make that complicated heuristic even more complicated.

But if that's not true for you, or even if you're computing gradients too but think you can see some savings by reusing dphidxi and only recomputing dphi, we might have some room to improve here. Our test coverage is literally a thousand times better than it was when I wrote the first pass of that optimization, so I'm no longer scared of making it a little more complicated still in the name of speed.

lindsayad commented 5 years ago

@drwells indicated that they often use 100+ quadrature points in their calculations, so I expect the speed-up could be quite appreciable. I guess only profiling would tell the truth of it😄

drwells commented 5 years ago

@roystgnr

my guess would be that I figured that most people in this case are asking for basis gradients or higher rather than just values, and we have to recompute gradients anyway so we might as well spend a little more time recomputing everything rather than make that complicated heuristic even more complicated.

Yes: as @lindsayad mentioned, we spend a ton of time in the part of IBAMR that sets up the load vector for an L2 projection (i.e., no derivatives): we use O(100) quadrature points per element and recomputing all the shape function values becomes a bottleneck here. This cost is only avoidable since we do not need any derivative values.

jwpeterson commented 5 years ago

@drwells what quadrature rule are you using? Are you integrating a smooth but highly oscillatory function? A very rough function? Just curious as this definitely isn't a typical use case...

boyceg commented 5 years ago

These cases with large numbers of quadrature points show up when we map functions between meshes with different element sizes. The number of quadrature points is related to the relative sizes of the two meshes. The integrands look like products of FE basis functions with a smooth (but generally non-polynomial) kernel. (The current numerical scheme is described here. I can provide some more details if you are interested. There are probably better ways to do it.) Roughly speaking, if we don't have enough quadrature points when we do fluid-solid coupling, we get leaks, so we don't want to use too few. :-)

@drwells, perhaps we should put in a unit test that the heuristics that are being used to dynamically determine the quadrature rules are reasonable. They are right for approximately isotropic quad/hex elements, but probably should be confirmed for other cases.

roystgnr commented 5 years ago

we spend a ton of time in the part of IBAMR that sets up the load vector for an L2 projection (i.e., no derivatives): we use O(100) quadrature points per element and recomputing all the shape function values becomes a bottleneck here. This cost is only avoidable since we do not need any derivative values.

Huh. I wonder if we can easily avoid this? Assuming you're properly prerequesting arrays from your FE object (and if not then you need to start!!!), we already keep track of whether we're supposed to be calculating derivatives or not, so we could probably check that same information earlier. if (!calculate_dphiref) then we don't need to care whether cached_nodes_still_fit or not, so long as the quadrature rule hasn't changed. That's not a one-liner change unfortunately but it would at least only refactor one function.

roystgnr commented 5 years ago

heuristics that are being used to dynamically determine the quadrature rules

How dynamically are we talking about? If your rule is usually changing from element to element then even in the best case we're not going to be able to get much optimization out of rarely caching old function values. (and in the worst case, if you aren't properly overriding shapes_need_reinit() then we're not going to detect the mistake, we're just going to silently cache now-wrong values)

boyceg commented 5 years ago

It depends on the strain states of the elements, but things can jump around a lot. Because we only need function values and quadrature weights here for LAGRANGE elements, currently we just cache the results from setting up each quadrature rule.