speed benchmark - Githubissues

halbux commented 3 years ago

Hi,

This is not an issue, just a question. I am developing my fem library too, in c++, but before that I worked in Matlab and I remember being able to get decent speeds with fully vectorized code. May I ask for a timing for my personal knowledge about what optimized python code can provide:

How long does it take to assemble a grad*grad (i.e. weak formulation of Laplace equation) on 1million quads at order 1 interpolation (and 2 and 3 as well if possible) on a structured quad square (no bc., just assembly time needed). Integration order should be exact for order 2 (so that's 4 gauss points if I am not mistaken). Time should be from assembly start to having the matrix A (of Ax=b) fully ready to be used. Corei7 4cores laptop would be best for the timing.

Thanks for your help!

kinnala commented 3 years ago

There is something in the README using triangles. For quads I have nothing premade but can try it out when I get back from the holidays.

For quads, there is the need of using a more complicated reference mapping if you want to be general w.r.t the shape of the quad. That will certainly sacrifice some cycles which could be neglected by using a less general mapping which just moves and scales the reference quad. We also don't have a very good support for parallel assembly at the moment and, thus, those numbers in the README are single-threaded assembly.

halbux commented 3 years ago

Ok, thank you. Yes assembly for general quads is good.

gdmcbain commented 3 years ago

Just to give you a very quick preliminary idea, I've put together something rudimentary: ex_670_speed_benchmark.py.

On my System 76 laptop running Pop!_OS 21.04 and just timing with

time python docs/examples/ex_670_speed_benchmark.py

(hence including meshing and building the basis), I get:	mapping	t/s
affine	1.8
isoparametric	5.3

halbux commented 3 years ago

Thanks!

What do you mean with affine on a quad (I can understand affine for a triangle though)? For the isoparametric case: are you talking about the usual bilinear shape function (4dofs) quad?

kinnala commented 3 years ago

I've actually never tried if MappingAffine works for a quad mesh like is done above. This is not done by default, and the behaviour can be undefined since its not tested at all.

halbux commented 3 years ago

That seems more than 30% faster than the best I could ever obtain in my former vectorized matlab code some time ago!

It is however a factor slower than the 0.95 sec time for 1 million (classical 4 dofs "isoparametric") quads I can get on my laptop with c++ in sparselizard.

kinnala commented 3 years ago

This part is rarely the bottleneck though based on my observations. If you have a nonlinear problem with really complex form, then assembly can be an issue because the jacobian is different for each step and a reassembly is required.

kinnala commented 3 years ago

It is however a factor slower than the 0.95 sec time for 1 million (classical 4 dofs "isoparametric") quads I can get on my laptop with c++ in sparselizard.

Is this single core assembly?

halbux commented 3 years ago

That's absolutely true for low order FEM but it actually become a dominant timing for higher order shape functions (>= order 3) where the assembly time can easily become longer than the matrix solve time if not optimized (for static, steady state and harmonic simulations at least). This is even more visible for electromagnetic problems, where even optimized assembly can still take a relatively large share of the total time (e.g. in magnetoquasistatic)

halbux commented 3 years ago

It is however a factor slower than the 0.95 sec time for 1 million (classical 4 dofs "isoparametric") quads I can get on my laptop with c++ in sparselizard.

Is this single core assembly?

No it's using the 4 cores of a core i7 cpu

kinnala commented 3 years ago

That's absolutely true for low order FEM but it actually become a dominant timing for higher order shape functions (>= order 3) where the assembly time can easily become longer than the matrix solve time if not optimized (for static, steady state and harmonic simulations at least). This is even more visible for electromagnetic problems, where even optimized assembly can still take a relatively large shatmre of the total time.

This is a good point. In scikit-fem we do not have a well optimized assembly routine for the case where you have lots of quadrature points per element. Everything is optimized around the case where you have lots of elements and a relatively low order integration. I have some ideas how to make it better that I hope to test in the future.

kinnala commented 3 years ago

Is this single core assembly?

No it's using the 4 cores of a core i7 cpu

I believe the above timing (5.3 s) is using only one core. We are not automatically doing any sort of multithreading but I could look into parallelizing this example at some point.

halbux commented 3 years ago

Sounds good. For any future benchmark for higher order you might want to do here are the timings for order 2 and 3 quads (i e. 9 and 16 dofs quads) (gauss quadrature order is increased accordingly):

4 dof quad --> 0.95 sec 9 dof quad --> 4.2 sec 16 dof quad --> 12.8 sec

on a quad core core i7 from 2019 with enough ram. It includes all from meshing (structured mesh so negligible time) to having matrix A fully ready to use. It is not optimized specifically for the grad*grad/structured mesh case and it does not take advantage of any matrix symmetry.

kinnala commented 3 years ago

I did 4 threads and added some caching:


import numpy as np
import timeit
import skfem as fe
from numba import jit

@jit(nogil=True, nopython=True)
def nlaplace(out, du, dv):
    for i in range(du.shape[1]):
        for j in range(du.shape[2]):
            for k in range(du.shape[0]):
                out[i, j] += du[k, i, j] * dv[k, i, j]

start = timeit.default_timer()
mesh = fe.MeshQuad1.init_tensor(*[np.linspace(0, 1, 1001)]*2)
element = fe.ElementQuad1()
mapping = fe.MappingIsoparametric(mesh, element)
basis = fe.Basis(mesh, element, mapping, intorder=2)

@fe.BilinearForm(nthreads=4)
def laplace(u, v, w):
    out = np.empty_like(u.grad[0])
    nlaplace(out, u.grad, v.grad)
    return out

lap = laplace.assemble(basis)
stop = timeit.default_timer()
print(stop - start)

and the following diffs:

@@ -164,6 +165,7 @@ class MappingIsoparametric(Mapping):

         return detDF

+    @lru_cache(maxsize=128)
     def invDF(self, X, tind=None):
         J = [[self.J(i, j, X, tind=tind) for j in range(self.dim)]
              for i in range(self.dim)]

@@ -10,7 +10,7 @@ class ElementH1(Element):
     def gbasis(self, mapping, X, i, tind=None):
         """Identity transformation."""
         phi, dphi = self.lbasis(X, i)
-        invDF = mapping.invDF(X, tind)
+        invDF = mapping.invDF(HashableNdArray(X), HashableNdArray(tind) if tind is not None else None)
         if len(X.shape) == 2:
             return (DiscreteField(
                 value=np.broadcast_to(phi, (invDF.shape[2], invDF.shape[3])),

It improved the timing from about 5.5 seconds to 3.6 seconds. Still some work to do though.

Thanks for the timings!

kinnala commented 3 years ago

@halbux by "matrix A ready to be used" you mean in some sort of CSC/CSR data structure, right?

halbux commented 3 years ago

Exactly: In CSR format (turns ou the conversion to csr takes some time after all). It s not a BLAS standard function so i had to write it myself. In mkl though there is a function for that which i bet is highly optimized

kinnala commented 3 years ago

Yes!

I found out about sparselizard already some time ago but I'm surprised that you work in Finland as well. If you ever happen to visit Aalto University you can find me in the main building M-wing.

I used to visit Tampere for project meetings and seminars some years ago but right now there is nothing going on. Maybe we must arrange Finnish Finite Element Fair or something. ;-)

halbux commented 3 years ago

I will gladly, but so far I have never been.

:)

gdmcbain commented 3 years ago

I've actually never tried if MappingAffine works for a quad mesh like is done above.

Oops, no, it doesn't!

import skfem as fe
from skfem.models.poisson import laplace

m = fe.MeshQuad()
elem = fe.ElementQuad1()

with np.printoptions(suppress=True, precision=3):
    for mapping in [fe.MappingAffine(m), fe.MappingIsoparametric(m, elem)]:
        print(laplace.assemble(fe.Basis(m, elem, mapping)).toarray())

gives

[[ 0.5 -0.5  0.  -0. ]
 [-0.5  1.5 -0.  -1. ]
 [ 0.  -0.   0.5 -0.5]
 [-0.  -1.  -0.5  1.5]]

[[ 0.667 -0.167 -0.333 -0.167]
 [-0.167  0.667 -0.167 -0.333]
 [-0.333 -0.167  0.667 -0.167]
 [-0.167 -0.333 -0.167  0.667]]

I had assumed that it would be O. K. on quadrilateral elements that were affinely similar to the reference square; i.e. parallelograms, but in particular rectangular constructions like

https://github.com/kinnala/scikit-fem/blob/824821871e1ac07ac6194888d1ceedbb55653aec/docs/examples/ex19.py#L53-L56

(And I am hoping that I have not assumed and used this in any work to date…!)

kinnala / scikit-fem

speed benchmark #670