Open wei3li opened 3 months ago
Hi @wei3li , thank you very much for this!
The only comment I have is that you are not using the CompositeQuadrature
(I think) which is the one that should make the MacroFEBasis efficient. Note CompositeMeasure
and CompositeQuadrature
are quite different (we need to sort out the names hahaha)...
Regardless, let me go through a first round of optimisations to get try to improve some of these results. There are a couple things in the current Macro elements that I already had flagged as not ideal. In particular, I had not optimised the autodiff. I think I can bring quite a lot of these times down.
I'll let you know when I think we can run the benchmarks again.
Update on Sep 10, 2024: macro-element implementation has been optimised by @JordiManyer.
The latest commit for branch adaptivity
is now 6924086
.
Three linearised bases are compared, each equipped with the same total integration points over the whole mesh. Trial order $k_U=2$ with $64\times64$ elements on the coarse mesh.
test basis | evaluation count | memory (MiB) | time median (ms) | time mean (ms) | time min (ms) | time max (ms) |
---|---|---|---|---|---|---|
standard | 56 | 28.342 | 179.852 | 179.068 | 174.260 | 187.039 |
refined | 53 | 38.863 | 186.863 | 189.322 | 181.598 | 355.710 |
macro | 56 | 34.006 | 179.661 | 178.736 | 173.257 | 186.880 |
q1-iso-qk | 58 | 28.295 | 173.588 | 174.835 | 171.254 | 182.265 |
test basis | evaluation count | memory (MiB) | time median (ms) | time mean (ms) | time min (ms) | time max (ms) |
---|---|---|---|---|---|---|
standard | 205 | 39.198 | 49.013 | 48.988 | 47.953 | 50.106 |
refined | 124 | 56.673 | 81.095 | 81.116 | 79.703 | 82.986 |
macro | 208 | 39.198 | 49.172 | 48.210 | 45.824 | 50.609 |
q1-iso-qk | 204 | 39.198 | 49.288 | 49.210 | 48.306 | 50.193 |
test basis | evaluation count | memory (MiB) | time median (ms) | time mean (ms) | time min (ms) | time max (ms) |
---|---|---|---|---|---|---|
standard | 1735 | 1.636 | 5.671 | 5.762 | 5.559 | 13.554 |
refined | 2039 | 1.376 | 4.834 | 4.903 | 4.555 | 8.090 |
macro | 737 | 15.465 | 12.520 | 13.556 | 12.294 | 17.329 |
q1-iso-qk | 1764 | 1.420 | 5.583 | 5.667 | 5.444 | 9.338 |
Now for the macro-element-based linearised test space, only gradient computation is a bit slow.
Thanks, @wei3li ! I'll do a second round of optimisations with emphasis on the jacobian. Most probably its some autodiff stuff causing issues.
@wei3li I haven't fixed this yet, but I think I know what it is. Just to confirm: Could you run your benchmarks with performance mode enabled?
I.e start your REPL in the project where you run your benchmarks and do
using Gridap
Gridap.Helpers.set_performance_mode()
then restart julia and run your benchmarks again. If this solves the issue, it means some @check
is causing this...
Edit:
I am quite positive the culprit is this check.
This particular check happens to make sure the operation between CellFields
makes sense. Unfortunately, it evaluates all CellFields
involved on Triangulation points, which in the case of MacroFEBasis
triggers the slow evaluation that searches for the point inside the RefinementRule
.
No straighforward way to avoid this, I'm afraid. Performance mode will disable this check, and everything should work great.
Hi @JordiManyer, I reran the experiment with the Gridap.jl
performance mode on Intel Xeon Scalable 'Sapphire Rapids' processors.
The following are the results:
test basis | evaluation count | memory (MiB) | time median (ms) | time mean (ms) | time min (ms) | time max (ms) |
---|---|---|---|---|---|---|
standard | 76 | 28.207 | 131.616 | 131.840 | 125.831 | 146.706 |
refined | 74 | 38.591 | 135.618 | 135.179 | 129.152 | 143.705 |
macro | 18 | 148.950 | 563.024 | 566.561 | 557.321 | 589.650 |
q1-iso-qk | 78 | 28.186 | 128.555 | 128.499 | 122.037 | 140.049 |
test basis | evaluation count | memory (MiB) | time median (ms) | time mean (ms) | time min (ms) | time max (ms) |
---|---|---|---|---|---|---|
standard | 221 | 39.198 | 47.318 | 45.282 | 38.065 | 53.381 |
refined | 136 | 56.673 | 73.666 | 73.957 | 70.592 | 79.824 |
macro | 206 | 39.198 | 48.660 | 48.679 | 45.669 | 55.677 |
q1-iso-qk | 226 | 39.198 | 46.366 | 44.319 | 37.565 | 51.116 |
test basis | evaluation count | memory (MiB) | time median (ms) | time mean (ms) | time min (ms) | time max (ms) |
---|---|---|---|---|---|---|
standard | 2034 | 0.840 | 4.832 | 4.914 | 4.590 | 8.879 |
refined | 2839 | 0.659 | 3.470 | 3.520 | 3.264 | 13.074 |
macro | 132 | 91.272 | 74.566 | 75.897 | 70.850 | 84.675 |
q1-iso-qk | 2060 | 0.787 | 4.755 | 4.852 | 4.572 | 8.539 |
The AD part in Gridap is probably very involved...
Background
There are currently five implementations of linearised bases in
Gridap
: the linear bases on the refined mesh, the macro-element bases lately implemented by @JordiManyer, the hqk-iso-q1 and qk-iso-q1 bases that @amartinhuertas implemented recently and the q1-iso-qk bases @wei3li implemented. This issue presents the experiment results comparing the performance of these bases. As a reference, the standard high-order test basis is also included in the tests.Experiment settings
adaptivity
with the last commitfedd16d
and branchrefined-discrete-models-linearized-fe-space
with the last commit3a1894b
.GenericMeasure
on the refined triangulation orCompositeMeasure
.Results
$k_U=2$, $100\times100$ elements
GenericMeasure
Matrix and vector assembly
Linear system solver
Gradient computation
CompositeMeasure
Matrix and vector assembly
Linear system solver
Gradient computation
$k_U=4$, $50\times50$ elements
GenericMeasure
Matrix and vector assembly
Linear system solver
Gradient computation
CompositeMeasure
Matrix and vector assembly
Linear system solver
Gradient computation
Notes & Comments
GenericMeasure
is used, the refined-based linear bases have the most outstanding performance, followed by qk-iso-q1 and q1-iso-qk bases, and then hqk-iso-q1 and macro bases. When theCompositeMeasure
is used, the ranking is the same except that the refined-based bases cannot be used.Appendix
Script for the experiments
```julia using Gridap, Gridap.Adaptivity, BenchmarkTools import FillArrays: Fill import Printf: @printf import Random include("LinearMonomialBases.jl") order, ncell = parse(Int, ARGS[1]), parse(Int, ARGS[2]) u((x, y)) = sin(3.2x * (x - y))cos(x + 4.3y) + sin(4.6 * (x + 2y))cos(2.6(y - 2x)) f(x) = -Δ(u)(x) bc_tags = Dict(:dirichlet_tags => "boundary") model = CartesianDiscreteModel((0, 1, 0, 1), (ncell, ncell)) ref_model = Adaptivity.refine(model, order) reffe1 = ReferenceFE(lagrangian, Float64, order) U = TrialFESpace(FESpace(model, reffe1; bc_tags...), u) dΩ⁺ = Measure(Triangulation(model), 15) function Gridap.FESpaces._compute_cell_ids( uh, strian::BodyFittedTriangulation, ttrian::AdaptedTriangulation) bgmodel = get_background_model(strian) refmodel = get_adapted_model(ttrian) @assert bgmodel === get_parent(refmodel) refmodel.glue.n2o_faces_map[end] end function compute_l2_h1_norms(u, dΩ) l2normsqr = ∑(∫(u * u)dΩ) ∇u = ∇(u) sqrt(l2normsqr), sqrt(l2normsqr + ∑(∫(∇u ⋅ ∇u)dΩ)) end function extract_benchmark(bmark) memo = bmark.memory / (1024^2) tmid, tmean = median(bmark.times) / 1e6, mean(bmark.times) / 1e6 tmin, tmax = extrema(bmark.times) ./ 1e6 length(bmark.times), memo, tmid, tmean, tmin, tmax end function run_experiment(test_basis, target, dΩ) if contains(test_basis, "refined") isa(dΩ, Gridap.CellData.CompositeMeasure) && return reffe2 = ReferenceFE(lagrangian, Float64, 1) V = FESpace(ref_model, reffe2; bc_tags...) else if contains(test_basis, "q1-iso-qk") reffe2 = Q1IsoQkRefFE(Float64, order, num_cell_dims(model)) elseif contains(test_basis, "hqk-iso-q1") reffe2 = Gridap.HQkIsoQ1(Float64, num_cell_dims(model), order) elseif contains(test_basis, "qk-iso-q1") reffe2 = Gridap.QkIsoQ1(Float64, num_cell_dims(model), order) elseif contains(test_basis, "macro") rrule = Adaptivity.RefinementRule(QUAD, (order, order)) poly = get_polytope(rrule) reffe = LagrangianRefFE(Float64, poly, 1) sub_reffes = Fill(reffe, Adaptivity.num_subcells(rrule)) reffe2 = Adaptivity.MacroReferenceFE(rrule, sub_reffes) else reffe2 = reffe1 end V = FESpace(model, reffe2; bc_tags...) end a(u, v) = ∫(∇(v) ⋅ ∇(u))dΩ l(v) = ∫(f * v)dΩ op = AffineFEOperator(a, l, U, V) eh = solve(op) - u fem_l2err, fem_h1err = compute_l2_h1_norms(eh, dΩ⁺) ipl_l2err, ipl_h1err = compute_l2_h1_norms(interpolate(u, U) - u, dΩ⁺) function err_msg(err_type, fem_err, ipl_err) "FEM $(err_type) error $(fem_err) is significantly larger than " * "FE interpolation $(err_type) error $(ipl_err)!" end @assert fem_l2err < ipl_l2err * 2 err_msg("L2", fem_l2err, ipl_l2err) @assert fem_h1err < ipl_h1err * 2 err_msg("H1", fem_h1err, ipl_h1err) rh = interpolate(x -> 1 + x[1] + x[2], V) function j(r) ∇r = ∇(r) ∫(r * r + ∇r ⋅ ∇r)dΩ end for _ in 1:3 solve(AffineFEOperator(a, l, U, V)) assemble_vector(Gridap.gradient(j, rh), V) end if contains(target, "assembly") bmark = @benchmark AffineFEOperator($a, $l, $U, $V) elseif contains(target, "solve") bmark = @benchmark solve($op) else bmark = @benchmark assemble_vector(Gridap.gradient($j, $rh), $V) end @printf "| %s | %d | %.3f | %.3f | %.3f | %.3f | %.3f |\n" test_basis extract_benchmark(bmark)... end BenchmarkTools.DEFAULT_PARAMETERS.seconds = 50 test_bases = ["standard", "refined", "macro", "hqk-iso-q1", "qk-iso-q1", "q1-iso-qk"] targets = ["assembly", "solve", "gradient"] ΩH, Ωh = Triangulation(model), Triangulation(ref_model) dΩs = [Measure(Ωh, order), Measure(ΩH, Ωh, order)] println("order $order $(ncell)x$(ncell) elements") for dΩ in dΩs println("\n\ntype of measure: $(typeof(dΩ).name.name)") for target in targets println("\n========== $target benchmark comparison starts ==========") print("| test basis | evaluation count | memory (MiB) |") println(" time median (ms) | time mean (ms) | time min (ms) | time max (ms) |") println("|:---:|:---:|:---:|:---:|:---:|:---:|:---:|") for test_basis in test_bases run_experiment(test_basis, target, dΩ) end end end ```