Comparison of performance among different linearised FE bases

wei3li commented 3 months ago

Background

There are currently five implementations of linearised bases in Gridap: the linear bases on the refined mesh, the macro-element bases lately implemented by @JordiManyer, the hqk-iso-q1 and qk-iso-q1 bases that @amartinhuertas implemented recently and the q1-iso-qk bases @wei3li implemented. This issue presents the experiment results comparing the performance of these bases. As a reference, the standard high-order test basis is also included in the tests.

Experiment settings

Code branches and commits: branch adaptivity with the last commit fedd16d and branch refined-discrete-models-linearized-fe-space with the last commit 3a1894b.
Integration measure: GenericMeasure on the refined triangulation or CompositeMeasure.
Cases studied: matrix and vector assembly, linear system solver, and gradient computation.
Time duration per benchmark: 50 seconds.
PDE: vanilla Poisson equation with pure Dirichlet boundary.
Domain: $[0,1]^2$
FE spaces: trial space order $k_U$, linearised or standard test space.
Mesh resolution: $100\times100$ ($k_U=2$) or $50\times50$ ($k_U=4$).

Results

$k_U=2$, $100\times100$ elements

`GenericMeasure`

Matrix and vector assembly

test basis	evaluation count	memory (MiB)	time median (ms)	time mean (ms)	time min (ms)	time max (ms)
standard	81	118.249	601.674	617.122	554.323	1508.448
refined	101	87.152	500.411	495.346	450.780	527.661
macro	74	228.233	673.310	683.508	605.975	1165.440
hqk-iso-q1	70	254.843	709.548	721.288	621.773	1155.562
qk-iso-q1	83	135.885	603.984	607.386	559.791	1095.971
q1-iso-qk	84	118.126	592.466	595.443	548.548	1066.729

Linear system solver

test basis	evaluation count	memory (MiB)	time median (ms)	time mean (ms)	time min (ms)	time max (ms)
standard	246	97.105	203.003	203.365	140.202	252.466
refined	143	159.734	356.640	350.272	240.478	839.505
macro	209	97.105	245.238	239.315	167.269	261.727
hqk-iso-q1	207	97.105	245.711	242.456	168.101	270.206
qk-iso-q1	209	97.105	243.747	239.700	168.415	262.383
q1-iso-qk	212	97.105	244.835	236.591	165.827	326.457

Gradient computation

test basis	evaluation count	memory (MiB)	time median (ms)	time mean (ms)	time min (ms)	time max (ms)
standard	1008	12.684	49.936	49.573	43.264	60.129
refined	4182	1.651	11.703	11.933	10.499	23.960
macro	112	471.928	464.568	449.553	340.901	493.479
hqk-iso-q1	88	570.564	599.284	570.029	417.078	629.738
qk-iso-q1	436	74.397	119.699	114.891	81.411	133.660
q1-iso-qk	985	12.211	50.504	50.764	43.212	60.455

`CompositeMeasure`

Matrix and vector assembly

test basis	evaluation count	memory (MiB)	time median (ms)	time mean (ms)	time min (ms)	time max (ms)
standard	100	76.268	501.499	501.428	471.969	551.721
macro	85	186.253	595.881	592.679	539.421	642.461
hqk-iso-q1	81	212.863	625.005	620.765	551.696	646.356
qk-iso-q1	93	93.905	543.263	541.403	497.973	565.248
q1-iso-qk	99	76.146	506.167	507.026	466.603	545.932

Linear system solver

test basis	evaluation count	memory (MiB)	time median (ms)	time mean (ms)	time min (ms)	time max (ms)
standard	208	97.105	243.940	240.931	166.541	260.470
macro	211	97.105	240.517	237.437	166.311	254.555
hqk-iso-q1	208	97.105	242.114	240.551	170.647	255.955
qk-iso-q1	209	97.105	240.733	239.253	167.338	258.390
q1-iso-qk	205	97.105	245.872	244.404	165.923	258.692

Gradient computation

test basis	evaluation count	memory (MiB)	time median (ms)	time mean (ms)	time min (ms)	time max (ms)
standard	896	15.500	55.823	55.787	46.525	66.228
macro	103	474.745	495.679	488.819	355.624	521.199
hqk-iso-q1	80	573.380	632.745	627.103	421.270	656.142
qk-iso-q1	421	77.214	121.660	118.903	90.275	140.181
q1-iso-qk	931	15.028	54.173	53.728	45.628	65.903

$k_U=4$, $50\times50$ elements

`GenericMeasure`

Matrix and vector assembly

test basis	evaluation count	memory (MiB)	time median (ms)	time mean (ms)	time min (ms)	time max (ms)
standard	23	511.088	2142.683	2182.792	2040.758	3218.346
refined	46	193.999	1096.248	1107.764	1029.602	1510.023
macro	11	2847.378	4569.994	4629.011	4396.836	5484.642
hqk-iso-q1	16	1479.299	3170.421	3184.379	2998.272	3872.120
qk-iso-q1	22	636.913	2287.823	2317.330	2182.037	3275.423
q1-iso-qk	28	510.276	1739.053	1816.646	1698.290	2814.544

Linear system solver

test basis	evaluation count	memory (MiB)	time median (ms)	time mean (ms)	time min (ms)	time max (ms)
standard	149	163.848	353.452	337.094	214.808	370.407
refined	110	380.753	454.400	456.432	338.113	1029.666
macro	144	163.848	356.063	348.772	219.289	391.376
hqk-iso-q1	146	163.848	358.913	342.579	221.435	383.034
qk-iso-q1	163	163.848	340.255	307.160	215.376	469.293
q1-iso-qk	163	163.848	339.354	306.980	216.203	367.015

Gradient computation

test basis	evaluation count	memory (MiB)	time median (ms)	time mean (ms)	time min (ms)	time max (ms)
standard	63	21.168	793.975	794.513	782.569	822.039
refined	2841	1.687	17.398	17.575	15.891	41.881
macro	6	8291.730	8871.615	8883.648	8835.629	8955.283
hqk-iso-q1	12	3373.315	4242.826	4271.691	4155.736	4401.237
qk-iso-q1	43	390.172	1182.750	1178.850	1061.146	1256.847
q1-iso-qk	64	18.584	790.719	791.845	780.108	818.857

`CompositeMeasure`

Matrix and vector assembly

test basis	evaluation count	memory (MiB)	time median (ms)	time mean (ms)	time min (ms)	time max (ms)
standard	33	160.284	1491.290	1519.152	1413.872	2552.225
macro	16	2496.574	3310.152	3338.588	3173.741	3827.743
hqk-iso-q1	21	1128.494	2352.375	2403.960	2208.001	3595.075
qk-iso-q1	33	286.109	1524.338	1561.950	1460.615	2726.133
q1-iso-qk	36	159.472	1406.520	1429.197	1386.492	2074.379

Linear system solver

test basis	evaluation count	memory (MiB)	time median (ms)	time mean (ms)	time min (ms)	time max (ms)
standard	140	163.848	360.311	357.244	223.392	378.598
macro	156	163.848	347.917	320.725	218.833	370.570
hqk-iso-q1	141	163.848	362.729	356.829	224.550	374.047
qk-iso-q1	143	163.848	356.111	351.339	221.776	369.621
q1-iso-qk	163	163.848	343.255	307.531	217.795	372.579

Gradient computation

test basis	evaluation count	memory (MiB)	time median (ms)	time mean (ms)	time min (ms)	time max (ms)
standard	64	27.512	782.116	781.978	769.104	800.036
macro	8	8298.074	6377.383	6365.425	6270.046	6489.458
hqk-iso-q1	15	3379.659	3357.935	3342.675	3167.876	3443.564
qk-iso-q1	53	396.516	946.360	952.732	843.435	1029.120
q1-iso-qk	76	24.928	644.555	658.625	624.246	726.009

Notes & Comments

For the matrix and vector assembly tests, when the GenericMeasure is used, the refined-based linear bases have the most outstanding performance, followed by qk-iso-q1 and q1-iso-qk bases, and then hqk-iso-q1 and macro bases. When the CompositeMeasure is used, the ranking is the same except that the refined-based bases cannot be used.
For the linear system solver tests, all the linear bases have similar performance, but the refined-based bases have a slightly worse performance than the others.
For the gradient computation tests, the refined-based bases have the best performance, beating other bases by a large margin. The hqk-iso-q1 and macro bases are the slowest and consume significantly more memory than the others.

Appendix

Script for the experiments

```julia using Gridap, Gridap.Adaptivity, BenchmarkTools import FillArrays: Fill import Printf: @printf import Random include("LinearMonomialBases.jl") order, ncell = parse(Int, ARGS[1]), parse(Int, ARGS[2]) u((x, y)) = sin(3.2x * (x - y))cos(x + 4.3y) + sin(4.6 * (x + 2y))cos(2.6(y - 2x)) f(x) = -Δ(u)(x) bc_tags = Dict(:dirichlet_tags => "boundary") model = CartesianDiscreteModel((0, 1, 0, 1), (ncell, ncell)) ref_model = Adaptivity.refine(model, order) reffe1 = ReferenceFE(lagrangian, Float64, order) U = TrialFESpace(FESpace(model, reffe1; bc_tags...), u) dΩ⁺ = Measure(Triangulation(model), 15) function Gridap.FESpaces._compute_cell_ids( uh, strian::BodyFittedTriangulation, ttrian::AdaptedTriangulation) bgmodel = get_background_model(strian) refmodel = get_adapted_model(ttrian) @assert bgmodel === get_parent(refmodel) refmodel.glue.n2o_faces_map[end] end function compute_l2_h1_norms(u, dΩ) l2normsqr = ∑(∫(u * u)dΩ) ∇u = ∇(u) sqrt(l2normsqr), sqrt(l2normsqr + ∑(∫(∇u ⋅ ∇u)dΩ)) end function extract_benchmark(bmark) memo = bmark.memory / (1024^2) tmid, tmean = median(bmark.times) / 1e6, mean(bmark.times) / 1e6 tmin, tmax = extrema(bmark.times) ./ 1e6 length(bmark.times), memo, tmid, tmean, tmin, tmax end function run_experiment(test_basis, target, dΩ) if contains(test_basis, "refined") isa(dΩ, Gridap.CellData.CompositeMeasure) && return reffe2 = ReferenceFE(lagrangian, Float64, 1) V = FESpace(ref_model, reffe2; bc_tags...) else if contains(test_basis, "q1-iso-qk") reffe2 = Q1IsoQkRefFE(Float64, order, num_cell_dims(model)) elseif contains(test_basis, "hqk-iso-q1") reffe2 = Gridap.HQkIsoQ1(Float64, num_cell_dims(model), order) elseif contains(test_basis, "qk-iso-q1") reffe2 = Gridap.QkIsoQ1(Float64, num_cell_dims(model), order) elseif contains(test_basis, "macro") rrule = Adaptivity.RefinementRule(QUAD, (order, order)) poly = get_polytope(rrule) reffe = LagrangianRefFE(Float64, poly, 1) sub_reffes = Fill(reffe, Adaptivity.num_subcells(rrule)) reffe2 = Adaptivity.MacroReferenceFE(rrule, sub_reffes) else reffe2 = reffe1 end V = FESpace(model, reffe2; bc_tags...) end a(u, v) = ∫(∇(v) ⋅ ∇(u))dΩ l(v) = ∫(f * v)dΩ op = AffineFEOperator(a, l, U, V) eh = solve(op) - u fem_l2err, fem_h1err = compute_l2_h1_norms(eh, dΩ⁺) ipl_l2err, ipl_h1err = compute_l2_h1_norms(interpolate(u, U) - u, dΩ⁺) function err_msg(err_type, fem_err, ipl_err) "FEM $(err_type) error $(fem_err) is significantly larger than " * "FE interpolation $(err_type) error $(ipl_err)!" end @assert fem_l2err < ipl_l2err * 2 err_msg("L2", fem_l2err, ipl_l2err) @assert fem_h1err < ipl_h1err * 2 err_msg("H1", fem_h1err, ipl_h1err) rh = interpolate(x -> 1 + x[1] + x[2], V) function j(r) ∇r = ∇(r) ∫(r * r + ∇r ⋅ ∇r)dΩ end for _ in 1:3 solve(AffineFEOperator(a, l, U, V)) assemble_vector(Gridap.gradient(j, rh), V) end if contains(target, "assembly") bmark = @benchmark AffineFEOperator($a, $l, $U, $V) elseif contains(target, "solve") bmark = @benchmark solve($op) else bmark = @benchmark assemble_vector(Gridap.gradient($j, $rh), $V) end @printf "| %s | %d | %.3f | %.3f | %.3f | %.3f | %.3f |\n" test_basis extract_benchmark(bmark)... end BenchmarkTools.DEFAULT_PARAMETERS.seconds = 50 test_bases = ["standard", "refined", "macro", "hqk-iso-q1", "qk-iso-q1", "q1-iso-qk"] targets = ["assembly", "solve", "gradient"] ΩH, Ωh = Triangulation(model), Triangulation(ref_model) dΩs = [Measure(Ωh, order), Measure(ΩH, Ωh, order)] println("order $order $(ncell)x$(ncell) elements") for dΩ in dΩs println("\n\ntype of measure: $(typeof(dΩ).name.name)") for target in targets println("\n========== $target benchmark comparison starts ==========") print("| test basis | evaluation count | memory (MiB) |") println(" time median (ms) | time mean (ms) | time min (ms) | time max (ms) |") println("|:---:|:---:|:---:|:---:|:---:|:---:|:---:|") for test_basis in test_bases run_experiment(test_basis, target, dΩ) end end end ```

JordiManyer commented 3 months ago

Hi @wei3li , thank you very much for this!

The only comment I have is that you are not using the CompositeQuadrature (I think) which is the one that should make the MacroFEBasis efficient. Note CompositeMeasure and CompositeQuadrature are quite different (we need to sort out the names hahaha)...

Regardless, let me go through a first round of optimisations to get try to improve some of these results. There are a couple things in the current Macro elements that I already had flagged as not ideal. In particular, I had not optimised the autodiff. I think I can bring quite a lot of these times down.

I'll let you know when I think we can run the benchmarks again.

wei3li commented 2 months ago

Update on Sep 10, 2024: macro-element implementation has been optimised by @JordiManyer. The latest commit for branch adaptivity is now 6924086.

Three linearised bases are compared, each equipped with the same total integration points over the whole mesh. Trial order $k_U=2$ with $64\times64$ elements on the coarse mesh.

Matrix and vector assembly

test basis	evaluation count	memory (MiB)	time median (ms)	time mean (ms)	time min (ms)	time max (ms)
standard	56	28.342	179.852	179.068	174.260	187.039
refined	53	38.863	186.863	189.322	181.598	355.710
macro	56	34.006	179.661	178.736	173.257	186.880
q1-iso-qk	58	28.295	173.588	174.835	171.254	182.265

Linear system solver

test basis	evaluation count	memory (MiB)	time median (ms)	time mean (ms)	time min (ms)	time max (ms)
standard	205	39.198	49.013	48.988	47.953	50.106
refined	124	56.673	81.095	81.116	79.703	82.986
macro	208	39.198	49.172	48.210	45.824	50.609
q1-iso-qk	204	39.198	49.288	49.210	48.306	50.193

Gradient computation

test basis	evaluation count	memory (MiB)	time median (ms)	time mean (ms)	time min (ms)	time max (ms)
standard	1735	1.636	5.671	5.762	5.559	13.554
refined	2039	1.376	4.834	4.903	4.555	8.090
macro	737	15.465	12.520	13.556	12.294	17.329
q1-iso-qk	1764	1.420	5.583	5.667	5.444	9.338

Now for the macro-element-based linearised test space, only gradient computation is a bit slow.

Appendix

Script for the experiments

```julia using Gridap, Gridap.Adaptivity, BenchmarkTools import FillArrays: Fill import Printf: @printf import Random include("LinearMonomialBases.jl") order, ncell = parse(Int, ARGS[1]), parse(Int, ARGS[2]) u((x, y)) = sin(3.2x * (x - y))cos(x + 4.3y) + sin(4.6 * (x + 2y))cos(2.6(y - 2x)) f(x) = -Δ(u)(x) bc_tags = Dict(:dirichlet_tags => "boundary") model = CartesianDiscreteModel((0, 1, 0, 1), (ncell, ncell)) ref_model = Adaptivity.refine(model, order) reffe1 = ReferenceFE(lagrangian, Float64, order) U = TrialFESpace(FESpace(model, reffe1; bc_tags...), u) ΩH, Ωh = Triangulation(model), Triangulation(ref_model) dΩ⁺ = Measure(ΩH, 15) function Gridap.FESpaces._compute_cell_ids( uh, strian::BodyFittedTriangulation, ttrian::AdaptedTriangulation) bgmodel = get_background_model(strian) refmodel = get_adapted_model(ttrian) @assert bgmodel === get_parent(refmodel) refmodel.glue.n2o_faces_map[end] end function compute_l2_h1_norms(u, dΩ) l2normsqr = ∑(∫(u * u)dΩ) ∇u = ∇(u) sqrt(l2normsqr), sqrt(l2normsqr + ∑(∫(∇u ⋅ ∇u)dΩ)) end function extract_benchmark(bmark) memo = bmark.memory / (1024^2) tmid, tmean = median(bmark.times) / 1e6, mean(bmark.times) / 1e6 tmin, tmax = extrema(bmark.times) ./ 1e6 length(bmark.times), memo, tmid, tmean, tmin, tmax end function run_experiment(test_basis, target) if contains(test_basis, "refined") reffe2 = ReferenceFE(lagrangian, Float64, 1) V = FESpace(ref_model, reffe2; bc_tags...) dΩ = Measure(Ωh, order) else if contains(test_basis, "q1-iso-qk") reffe2 = Q1IsoQkRefFE(Float64, order, num_cell_dims(model)) dΩ = LinearMeasure(ΩH, order) elseif contains(test_basis, "macro") rrule = Adaptivity.RefinementRule(QUAD, (order, order)) poly = get_polytope(rrule) reffe = LagrangianRefFE(Float64, poly, 1) sub_reffes = Fill(reffe, Adaptivity.num_subcells(rrule)) reffe2 = Adaptivity.MacroReferenceFE(rrule, sub_reffes) quad = Quadrature(rrule, 2) dΩ = Measure(ΩH, quad) else reffe2 = reffe1 dΩ = Measure(ΩH, 3order + 1) end V = FESpace(model, reffe2; bc_tags...) end a(u, v) = ∫(∇(v) ⋅ ∇(u))dΩ l(v) = ∫(f * v)dΩ op = AffineFEOperator(a, l, U, V) eh = solve(op) - u fem_l2err, fem_h1err = compute_l2_h1_norms(eh, dΩ⁺) ipl_l2err, ipl_h1err = compute_l2_h1_norms(interpolate(u, U) - u, dΩ⁺) function err_msg(err_type, fem_err, ipl_err) "FEM $(err_type) error $(fem_err) is significantly larger than " * "FE interpolation $(err_type) error $(ipl_err)!" end @assert fem_l2err < ipl_l2err * 2 err_msg("L2", fem_l2err, ipl_l2err) @assert fem_h1err < ipl_h1err * 2 err_msg("H1", fem_h1err, ipl_h1err) rh = interpolate(x -> 1 + x[1] + x[2], V) function j(r) ∇r = ∇(r) ∫(r * r + ∇r ⋅ ∇r)dΩ end for _ in 1:3 solve(AffineFEOperator(a, l, U, V)) assemble_vector(Gridap.gradient(j, rh), V) end if contains(target, "assembly") bmark = @benchmark AffineFEOperator($a, $l, $U, $V) elseif contains(target, "solve") bmark = @benchmark solve($op) else bmark = @benchmark assemble_vector(Gridap.gradient($j, $rh), $V) end @printf "| %s | %d | %.3f | %.3f | %.3f | %.3f | %.3f |\n" test_basis extract_benchmark(bmark)... end BenchmarkTools.DEFAULT_PARAMETERS.seconds = 10 test_bases = ["standard", "refined", "macro", "q1-iso-qk"] targets = ["assembly", "solve", "gradient"] println("\norder $order $(ncell)x$(ncell) elements\n") for target in targets println("\n\n$(target) benchmark starting...\n") print("| test basis | evaluation count | memory (MiB) |") println(" time median (ms) | time mean (ms) | time min (ms) | time max (ms) |") println("|:---:|:---:|:---:|:---:|:---:|:---:|:---:|") for test_basis in test_bases run_experiment(test_basis, target) end end ```

JordiManyer commented 2 months ago

Thanks, @wei3li ! I'll do a second round of optimisations with emphasis on the jacobian. Most probably its some autodiff stuff causing issues.

JordiManyer commented 2 months ago

@wei3li I haven't fixed this yet, but I think I know what it is. Just to confirm: Could you run your benchmarks with performance mode enabled?

I.e start your REPL in the project where you run your benchmarks and do

using Gridap
Gridap.Helpers.set_performance_mode()

then restart julia and run your benchmarks again. If this solves the issue, it means some @check is causing this...

Edit: I am quite positive the culprit is this check. This particular check happens to make sure the operation between CellFields makes sense. Unfortunately, it evaluates all CellFields involved on Triangulation points, which in the case of MacroFEBasis triggers the slow evaluation that searches for the point inside the RefinementRule.

No straighforward way to avoid this, I'm afraid. Performance mode will disable this check, and everything should work great.

wei3li commented 2 months ago

Hi @JordiManyer, I reran the experiment with the Gridap.jl performance mode on Intel Xeon Scalable 'Sapphire Rapids' processors.

The following are the results:

Matrix and vector assembly

test basis	evaluation count	memory (MiB)	time median (ms)	time mean (ms)	time min (ms)	time max (ms)
standard	76	28.207	131.616	131.840	125.831	146.706
refined	74	38.591	135.618	135.179	129.152	143.705
macro	18	148.950	563.024	566.561	557.321	589.650
q1-iso-qk	78	28.186	128.555	128.499	122.037	140.049

Linear system solver

test basis	evaluation count	memory (MiB)	time median (ms)	time mean (ms)	time min (ms)	time max (ms)
standard	221	39.198	47.318	45.282	38.065	53.381
refined	136	56.673	73.666	73.957	70.592	79.824
macro	206	39.198	48.660	48.679	45.669	55.677
q1-iso-qk	226	39.198	46.366	44.319	37.565	51.116

Gradient computation

test basis	evaluation count	memory (MiB)	time median (ms)	time mean (ms)	time min (ms)	time max (ms)
standard	2034	0.840	4.832	4.914	4.590	8.879
refined	2839	0.659	3.470	3.520	3.264	13.074
macro	132	91.272	74.566	75.897	70.850	84.675
q1-iso-qk	2060	0.787	4.755	4.852	4.572	8.539

The AD part in Gridap is probably very involved...

gridap / Gridap.jl