Closed trevilo closed 3 years ago
@koomie: I seem to remember you ran some extra tests before merging your performance optimizations in #32. Is that correct? If so, can you remind me what you did so I can reproduce it here. Thx.
@koomie so this takes the code to be the same as Nektar++, right? (performance-wise that is)
@koomie: I seem to remember you ran some extra tests before merging your performance optimizations in #32. Is that correct? If so, can you remind me what you did so I can reproduce it here. Thx.
The extra bit I did was just to run for more iterations and make sure the solution differ passed. I believe I did 1000 iterations for the comparison (most likely, using config of input.4iters.cyl
)
The extra bit I did was just to run for more iterations and make sure the solution differ passed. I believe I did 1000 iterations for the comparison (most likely, using config of
input.4iters.cyl
)
Thanks. I'm starting that test now.
The extra bit I did was just to run for more iterations and make sure the solution differ passed. I believe I did 1000 iterations for the comparison (most likely, using config of
input.4iters.cyl
)
Ok, no diffs here (to within our default tolerances of course) after 1000 p=3 iterations.
Well done. 👍
This PR includes performance improvements targeting the cpu version at moderate order. Specifically, we observe approximately 25-30% reduction in the time per time step for the coarse cylinder case in serial at p=3. The test case input is pasted below. It is the same as the p=1 coarse cyliner regression test except that the order has been raised to 3 and the CFL has been reduced to 0.12.
I am using the
koomietx/mfem:4.2.tps
container with the following configurationWith this setup, running the above case with
main
produces the following:With the mods on this branch, the same configuration produces
So, the time per iteration drops from 4.90 sec to 3.36 sec, an improvement of approximately 31%.
Most of this gain is realized by refactoring
Gradients::computeGradients_cpu()
in 0af305f. This commit introduces aDenseMatrix
for each element, denotedKe[el]
, that contains the operator corresponding to the integral of the basis function times its gradient over the element. The array of these matrices is stored with theGradients
class and evaluated at construction. This allows us to avoid re-evaluating the basis functions and their gradients at every call toGradients::computeGradients_cpu
, which gives a substantial performance benefit, at the cost of storing these element matrices of course.Conceptually similar modifications might also be possible for
GradFaceIntegrator::AssembleFaceVector
. Currently this capability is implemented using MFEM'sNonlinearFormIntegrator
class, but it is actually a linear operator. So, we could form the linear operator at construction time, store it, and then apply it at every substep. This would avoid the need to repeatedly evaluate the basis functions, which might reduce the cost. However, this refactor is a bit more complicated, so I decided not to pursue it here.Other modest gains are realized here by using MFEM's built in linear algebra functionality where possible and by some minor loop re-ordering and moving memory allocation out of inner loops.