gridap / Gridap.jl

Grid-based approximation of partial differential equations in Julia
MIT License
714 stars 97 forks source link

Comparison of performance among different linearised FE bases #1023

Open wei3li opened 3 months ago

wei3li commented 3 months ago

Background

There are currently five implementations of linearised bases in Gridap: the linear bases on the refined mesh, the macro-element bases lately implemented by @JordiManyer, the hqk-iso-q1 and qk-iso-q1 bases that @amartinhuertas implemented recently and the q1-iso-qk bases @wei3li implemented. This issue presents the experiment results comparing the performance of these bases. As a reference, the standard high-order test basis is also included in the tests.

Experiment settings

Results

$k_U=2$, $100\times100$ elements

GenericMeasure

Matrix and vector assembly
test basis evaluation count memory (MiB) time median (ms) time mean (ms) time min (ms) time max (ms)
standard 81 118.249 601.674 617.122 554.323 1508.448
refined 101 87.152 500.411 495.346 450.780 527.661
macro 74 228.233 673.310 683.508 605.975 1165.440
hqk-iso-q1 70 254.843 709.548 721.288 621.773 1155.562
qk-iso-q1 83 135.885 603.984 607.386 559.791 1095.971
q1-iso-qk 84 118.126 592.466 595.443 548.548 1066.729
Linear system solver
test basis evaluation count memory (MiB) time median (ms) time mean (ms) time min (ms) time max (ms)
standard 246 97.105 203.003 203.365 140.202 252.466
refined 143 159.734 356.640 350.272 240.478 839.505
macro 209 97.105 245.238 239.315 167.269 261.727
hqk-iso-q1 207 97.105 245.711 242.456 168.101 270.206
qk-iso-q1 209 97.105 243.747 239.700 168.415 262.383
q1-iso-qk 212 97.105 244.835 236.591 165.827 326.457
Gradient computation
test basis evaluation count memory (MiB) time median (ms) time mean (ms) time min (ms) time max (ms)
standard 1008 12.684 49.936 49.573 43.264 60.129
refined 4182 1.651 11.703 11.933 10.499 23.960
macro 112 471.928 464.568 449.553 340.901 493.479
hqk-iso-q1 88 570.564 599.284 570.029 417.078 629.738
qk-iso-q1 436 74.397 119.699 114.891 81.411 133.660
q1-iso-qk 985 12.211 50.504 50.764 43.212 60.455

CompositeMeasure

Matrix and vector assembly
test basis evaluation count memory (MiB) time median (ms) time mean (ms) time min (ms) time max (ms)
standard 100 76.268 501.499 501.428 471.969 551.721
macro 85 186.253 595.881 592.679 539.421 642.461
hqk-iso-q1 81 212.863 625.005 620.765 551.696 646.356
qk-iso-q1 93 93.905 543.263 541.403 497.973 565.248
q1-iso-qk 99 76.146 506.167 507.026 466.603 545.932
Linear system solver
test basis evaluation count memory (MiB) time median (ms) time mean (ms) time min (ms) time max (ms)
standard 208 97.105 243.940 240.931 166.541 260.470
macro 211 97.105 240.517 237.437 166.311 254.555
hqk-iso-q1 208 97.105 242.114 240.551 170.647 255.955
qk-iso-q1 209 97.105 240.733 239.253 167.338 258.390
q1-iso-qk 205 97.105 245.872 244.404 165.923 258.692
Gradient computation
test basis evaluation count memory (MiB) time median (ms) time mean (ms) time min (ms) time max (ms)
standard 896 15.500 55.823 55.787 46.525 66.228
macro 103 474.745 495.679 488.819 355.624 521.199
hqk-iso-q1 80 573.380 632.745 627.103 421.270 656.142
qk-iso-q1 421 77.214 121.660 118.903 90.275 140.181
q1-iso-qk 931 15.028 54.173 53.728 45.628 65.903

$k_U=4$, $50\times50$ elements

GenericMeasure

Matrix and vector assembly
test basis evaluation count memory (MiB) time median (ms) time mean (ms) time min (ms) time max (ms)
standard 23 511.088 2142.683 2182.792 2040.758 3218.346
refined 46 193.999 1096.248 1107.764 1029.602 1510.023
macro 11 2847.378 4569.994 4629.011 4396.836 5484.642
hqk-iso-q1 16 1479.299 3170.421 3184.379 2998.272 3872.120
qk-iso-q1 22 636.913 2287.823 2317.330 2182.037 3275.423
q1-iso-qk 28 510.276 1739.053 1816.646 1698.290 2814.544
Linear system solver
test basis evaluation count memory (MiB) time median (ms) time mean (ms) time min (ms) time max (ms)
standard 149 163.848 353.452 337.094 214.808 370.407
refined 110 380.753 454.400 456.432 338.113 1029.666
macro 144 163.848 356.063 348.772 219.289 391.376
hqk-iso-q1 146 163.848 358.913 342.579 221.435 383.034
qk-iso-q1 163 163.848 340.255 307.160 215.376 469.293
q1-iso-qk 163 163.848 339.354 306.980 216.203 367.015
Gradient computation
test basis evaluation count memory (MiB) time median (ms) time mean (ms) time min (ms) time max (ms)
standard 63 21.168 793.975 794.513 782.569 822.039
refined 2841 1.687 17.398 17.575 15.891 41.881
macro 6 8291.730 8871.615 8883.648 8835.629 8955.283
hqk-iso-q1 12 3373.315 4242.826 4271.691 4155.736 4401.237
qk-iso-q1 43 390.172 1182.750 1178.850 1061.146 1256.847
q1-iso-qk 64 18.584 790.719 791.845 780.108 818.857

CompositeMeasure

Matrix and vector assembly
test basis evaluation count memory (MiB) time median (ms) time mean (ms) time min (ms) time max (ms)
standard 33 160.284 1491.290 1519.152 1413.872 2552.225
macro 16 2496.574 3310.152 3338.588 3173.741 3827.743
hqk-iso-q1 21 1128.494 2352.375 2403.960 2208.001 3595.075
qk-iso-q1 33 286.109 1524.338 1561.950 1460.615 2726.133
q1-iso-qk 36 159.472 1406.520 1429.197 1386.492 2074.379
Linear system solver
test basis evaluation count memory (MiB) time median (ms) time mean (ms) time min (ms) time max (ms)
standard 140 163.848 360.311 357.244 223.392 378.598
macro 156 163.848 347.917 320.725 218.833 370.570
hqk-iso-q1 141 163.848 362.729 356.829 224.550 374.047
qk-iso-q1 143 163.848 356.111 351.339 221.776 369.621
q1-iso-qk 163 163.848 343.255 307.531 217.795 372.579
Gradient computation
test basis evaluation count memory (MiB) time median (ms) time mean (ms) time min (ms) time max (ms)
standard 64 27.512 782.116 781.978 769.104 800.036
macro 8 8298.074 6377.383 6365.425 6270.046 6489.458
hqk-iso-q1 15 3379.659 3357.935 3342.675 3167.876 3443.564
qk-iso-q1 53 396.516 946.360 952.732 843.435 1029.120
q1-iso-qk 76 24.928 644.555 658.625 624.246 726.009

Notes & Comments

  1. For the matrix and vector assembly tests, when the GenericMeasure is used, the refined-based linear bases have the most outstanding performance, followed by qk-iso-q1 and q1-iso-qk bases, and then hqk-iso-q1 and macro bases. When the CompositeMeasure is used, the ranking is the same except that the refined-based bases cannot be used.
  2. For the linear system solver tests, all the linear bases have similar performance, but the refined-based bases have a slightly worse performance than the others.
  3. For the gradient computation tests, the refined-based bases have the best performance, beating other bases by a large margin. The hqk-iso-q1 and macro bases are the slowest and consume significantly more memory than the others.

Appendix

Script for the experiments ```julia using Gridap, Gridap.Adaptivity, BenchmarkTools import FillArrays: Fill import Printf: @printf import Random include("LinearMonomialBases.jl") order, ncell = parse(Int, ARGS[1]), parse(Int, ARGS[2]) u((x, y)) = sin(3.2x * (x - y))cos(x + 4.3y) + sin(4.6 * (x + 2y))cos(2.6(y - 2x)) f(x) = -Δ(u)(x) bc_tags = Dict(:dirichlet_tags => "boundary") model = CartesianDiscreteModel((0, 1, 0, 1), (ncell, ncell)) ref_model = Adaptivity.refine(model, order) reffe1 = ReferenceFE(lagrangian, Float64, order) U = TrialFESpace(FESpace(model, reffe1; bc_tags...), u) dΩ⁺ = Measure(Triangulation(model), 15) function Gridap.FESpaces._compute_cell_ids( uh, strian::BodyFittedTriangulation, ttrian::AdaptedTriangulation) bgmodel = get_background_model(strian) refmodel = get_adapted_model(ttrian) @assert bgmodel === get_parent(refmodel) refmodel.glue.n2o_faces_map[end] end function compute_l2_h1_norms(u, dΩ) l2normsqr = ∑(∫(u * u)dΩ) ∇u = ∇(u) sqrt(l2normsqr), sqrt(l2normsqr + ∑(∫(∇u ⋅ ∇u)dΩ)) end function extract_benchmark(bmark) memo = bmark.memory / (1024^2) tmid, tmean = median(bmark.times) / 1e6, mean(bmark.times) / 1e6 tmin, tmax = extrema(bmark.times) ./ 1e6 length(bmark.times), memo, tmid, tmean, tmin, tmax end function run_experiment(test_basis, target, dΩ) if contains(test_basis, "refined") isa(dΩ, Gridap.CellData.CompositeMeasure) && return reffe2 = ReferenceFE(lagrangian, Float64, 1) V = FESpace(ref_model, reffe2; bc_tags...) else if contains(test_basis, "q1-iso-qk") reffe2 = Q1IsoQkRefFE(Float64, order, num_cell_dims(model)) elseif contains(test_basis, "hqk-iso-q1") reffe2 = Gridap.HQkIsoQ1(Float64, num_cell_dims(model), order) elseif contains(test_basis, "qk-iso-q1") reffe2 = Gridap.QkIsoQ1(Float64, num_cell_dims(model), order) elseif contains(test_basis, "macro") rrule = Adaptivity.RefinementRule(QUAD, (order, order)) poly = get_polytope(rrule) reffe = LagrangianRefFE(Float64, poly, 1) sub_reffes = Fill(reffe, Adaptivity.num_subcells(rrule)) reffe2 = Adaptivity.MacroReferenceFE(rrule, sub_reffes) else reffe2 = reffe1 end V = FESpace(model, reffe2; bc_tags...) end a(u, v) = ∫(∇(v) ⋅ ∇(u))dΩ l(v) = ∫(f * v)dΩ op = AffineFEOperator(a, l, U, V) eh = solve(op) - u fem_l2err, fem_h1err = compute_l2_h1_norms(eh, dΩ⁺) ipl_l2err, ipl_h1err = compute_l2_h1_norms(interpolate(u, U) - u, dΩ⁺) function err_msg(err_type, fem_err, ipl_err) "FEM $(err_type) error $(fem_err) is significantly larger than " * "FE interpolation $(err_type) error $(ipl_err)!" end @assert fem_l2err < ipl_l2err * 2 err_msg("L2", fem_l2err, ipl_l2err) @assert fem_h1err < ipl_h1err * 2 err_msg("H1", fem_h1err, ipl_h1err) rh = interpolate(x -> 1 + x[1] + x[2], V) function j(r) ∇r = ∇(r) ∫(r * r + ∇r ⋅ ∇r)dΩ end for _ in 1:3 solve(AffineFEOperator(a, l, U, V)) assemble_vector(Gridap.gradient(j, rh), V) end if contains(target, "assembly") bmark = @benchmark AffineFEOperator($a, $l, $U, $V) elseif contains(target, "solve") bmark = @benchmark solve($op) else bmark = @benchmark assemble_vector(Gridap.gradient($j, $rh), $V) end @printf "| %s | %d | %.3f | %.3f | %.3f | %.3f | %.3f |\n" test_basis extract_benchmark(bmark)... end BenchmarkTools.DEFAULT_PARAMETERS.seconds = 50 test_bases = ["standard", "refined", "macro", "hqk-iso-q1", "qk-iso-q1", "q1-iso-qk"] targets = ["assembly", "solve", "gradient"] ΩH, Ωh = Triangulation(model), Triangulation(ref_model) dΩs = [Measure(Ωh, order), Measure(ΩH, Ωh, order)] println("order $order $(ncell)x$(ncell) elements") for dΩ in dΩs println("\n\ntype of measure: $(typeof(dΩ).name.name)") for target in targets println("\n========== $target benchmark comparison starts ==========") print("| test basis | evaluation count | memory (MiB) |") println(" time median (ms) | time mean (ms) | time min (ms) | time max (ms) |") println("|:---:|:---:|:---:|:---:|:---:|:---:|:---:|") for test_basis in test_bases run_experiment(test_basis, target, dΩ) end end end ```
JordiManyer commented 3 months ago

Hi @wei3li , thank you very much for this!

The only comment I have is that you are not using the CompositeQuadrature (I think) which is the one that should make the MacroFEBasis efficient. Note CompositeMeasure and CompositeQuadrature are quite different (we need to sort out the names hahaha)...

Regardless, let me go through a first round of optimisations to get try to improve some of these results. There are a couple things in the current Macro elements that I already had flagged as not ideal. In particular, I had not optimised the autodiff. I think I can bring quite a lot of these times down.

I'll let you know when I think we can run the benchmarks again.

wei3li commented 2 months ago

Update on Sep 10, 2024: macro-element implementation has been optimised by @JordiManyer. The latest commit for branch adaptivity is now 6924086.

Three linearised bases are compared, each equipped with the same total integration points over the whole mesh. Trial order $k_U=2$ with $64\times64$ elements on the coarse mesh.

Matrix and vector assembly

test basis evaluation count memory (MiB) time median (ms) time mean (ms) time min (ms) time max (ms)
standard 56 28.342 179.852 179.068 174.260 187.039
refined 53 38.863 186.863 189.322 181.598 355.710
macro 56 34.006 179.661 178.736 173.257 186.880
q1-iso-qk 58 28.295 173.588 174.835 171.254 182.265

Linear system solver

test basis evaluation count memory (MiB) time median (ms) time mean (ms) time min (ms) time max (ms)
standard 205 39.198 49.013 48.988 47.953 50.106
refined 124 56.673 81.095 81.116 79.703 82.986
macro 208 39.198 49.172 48.210 45.824 50.609
q1-iso-qk 204 39.198 49.288 49.210 48.306 50.193

Gradient computation

test basis evaluation count memory (MiB) time median (ms) time mean (ms) time min (ms) time max (ms)
standard 1735 1.636 5.671 5.762 5.559 13.554
refined 2039 1.376 4.834 4.903 4.555 8.090
macro 737 15.465 12.520 13.556 12.294 17.329
q1-iso-qk 1764 1.420 5.583 5.667 5.444 9.338

Now for the macro-element-based linearised test space, only gradient computation is a bit slow.

Appendix

Script for the experiments ```julia using Gridap, Gridap.Adaptivity, BenchmarkTools import FillArrays: Fill import Printf: @printf import Random include("LinearMonomialBases.jl") order, ncell = parse(Int, ARGS[1]), parse(Int, ARGS[2]) u((x, y)) = sin(3.2x * (x - y))cos(x + 4.3y) + sin(4.6 * (x + 2y))cos(2.6(y - 2x)) f(x) = -Δ(u)(x) bc_tags = Dict(:dirichlet_tags => "boundary") model = CartesianDiscreteModel((0, 1, 0, 1), (ncell, ncell)) ref_model = Adaptivity.refine(model, order) reffe1 = ReferenceFE(lagrangian, Float64, order) U = TrialFESpace(FESpace(model, reffe1; bc_tags...), u) ΩH, Ωh = Triangulation(model), Triangulation(ref_model) dΩ⁺ = Measure(ΩH, 15) function Gridap.FESpaces._compute_cell_ids( uh, strian::BodyFittedTriangulation, ttrian::AdaptedTriangulation) bgmodel = get_background_model(strian) refmodel = get_adapted_model(ttrian) @assert bgmodel === get_parent(refmodel) refmodel.glue.n2o_faces_map[end] end function compute_l2_h1_norms(u, dΩ) l2normsqr = ∑(∫(u * u)dΩ) ∇u = ∇(u) sqrt(l2normsqr), sqrt(l2normsqr + ∑(∫(∇u ⋅ ∇u)dΩ)) end function extract_benchmark(bmark) memo = bmark.memory / (1024^2) tmid, tmean = median(bmark.times) / 1e6, mean(bmark.times) / 1e6 tmin, tmax = extrema(bmark.times) ./ 1e6 length(bmark.times), memo, tmid, tmean, tmin, tmax end function run_experiment(test_basis, target) if contains(test_basis, "refined") reffe2 = ReferenceFE(lagrangian, Float64, 1) V = FESpace(ref_model, reffe2; bc_tags...) dΩ = Measure(Ωh, order) else if contains(test_basis, "q1-iso-qk") reffe2 = Q1IsoQkRefFE(Float64, order, num_cell_dims(model)) dΩ = LinearMeasure(ΩH, order) elseif contains(test_basis, "macro") rrule = Adaptivity.RefinementRule(QUAD, (order, order)) poly = get_polytope(rrule) reffe = LagrangianRefFE(Float64, poly, 1) sub_reffes = Fill(reffe, Adaptivity.num_subcells(rrule)) reffe2 = Adaptivity.MacroReferenceFE(rrule, sub_reffes) quad = Quadrature(rrule, 2) dΩ = Measure(ΩH, quad) else reffe2 = reffe1 dΩ = Measure(ΩH, 3order + 1) end V = FESpace(model, reffe2; bc_tags...) end a(u, v) = ∫(∇(v) ⋅ ∇(u))dΩ l(v) = ∫(f * v)dΩ op = AffineFEOperator(a, l, U, V) eh = solve(op) - u fem_l2err, fem_h1err = compute_l2_h1_norms(eh, dΩ⁺) ipl_l2err, ipl_h1err = compute_l2_h1_norms(interpolate(u, U) - u, dΩ⁺) function err_msg(err_type, fem_err, ipl_err) "FEM $(err_type) error $(fem_err) is significantly larger than " * "FE interpolation $(err_type) error $(ipl_err)!" end @assert fem_l2err < ipl_l2err * 2 err_msg("L2", fem_l2err, ipl_l2err) @assert fem_h1err < ipl_h1err * 2 err_msg("H1", fem_h1err, ipl_h1err) rh = interpolate(x -> 1 + x[1] + x[2], V) function j(r) ∇r = ∇(r) ∫(r * r + ∇r ⋅ ∇r)dΩ end for _ in 1:3 solve(AffineFEOperator(a, l, U, V)) assemble_vector(Gridap.gradient(j, rh), V) end if contains(target, "assembly") bmark = @benchmark AffineFEOperator($a, $l, $U, $V) elseif contains(target, "solve") bmark = @benchmark solve($op) else bmark = @benchmark assemble_vector(Gridap.gradient($j, $rh), $V) end @printf "| %s | %d | %.3f | %.3f | %.3f | %.3f | %.3f |\n" test_basis extract_benchmark(bmark)... end BenchmarkTools.DEFAULT_PARAMETERS.seconds = 10 test_bases = ["standard", "refined", "macro", "q1-iso-qk"] targets = ["assembly", "solve", "gradient"] println("\norder $order $(ncell)x$(ncell) elements\n") for target in targets println("\n\n$(target) benchmark starting...\n") print("| test basis | evaluation count | memory (MiB) |") println(" time median (ms) | time mean (ms) | time min (ms) | time max (ms) |") println("|:---:|:---:|:---:|:---:|:---:|:---:|:---:|") for test_basis in test_bases run_experiment(test_basis, target) end end ```
JordiManyer commented 2 months ago

Thanks, @wei3li ! I'll do a second round of optimisations with emphasis on the jacobian. Most probably its some autodiff stuff causing issues.

JordiManyer commented 2 months ago

@wei3li I haven't fixed this yet, but I think I know what it is. Just to confirm: Could you run your benchmarks with performance mode enabled?

I.e start your REPL in the project where you run your benchmarks and do

using Gridap
Gridap.Helpers.set_performance_mode()

then restart julia and run your benchmarks again. If this solves the issue, it means some @check is causing this...


Edit: I am quite positive the culprit is this check. This particular check happens to make sure the operation between CellFields makes sense. Unfortunately, it evaluates all CellFields involved on Triangulation points, which in the case of MacroFEBasis triggers the slow evaluation that searches for the point inside the RefinementRule.

No straighforward way to avoid this, I'm afraid. Performance mode will disable this check, and everything should work great.

wei3li commented 2 months ago

Hi @JordiManyer, I reran the experiment with the Gridap.jl performance mode on Intel Xeon Scalable 'Sapphire Rapids' processors.

The following are the results:

Matrix and vector assembly

test basis evaluation count memory (MiB) time median (ms) time mean (ms) time min (ms) time max (ms)
standard 76 28.207 131.616 131.840 125.831 146.706
refined 74 38.591 135.618 135.179 129.152 143.705
macro 18 148.950 563.024 566.561 557.321 589.650
q1-iso-qk 78 28.186 128.555 128.499 122.037 140.049

Linear system solver

test basis evaluation count memory (MiB) time median (ms) time mean (ms) time min (ms) time max (ms)
standard 221 39.198 47.318 45.282 38.065 53.381
refined 136 56.673 73.666 73.957 70.592 79.824
macro 206 39.198 48.660 48.679 45.669 55.677
q1-iso-qk 226 39.198 46.366 44.319 37.565 51.116

Gradient computation

test basis evaluation count memory (MiB) time median (ms) time mean (ms) time min (ms) time max (ms)
standard 2034 0.840 4.832 4.914 4.590 8.879
refined 2839 0.659 3.470 3.520 3.264 13.074
macro 132 91.272 74.566 75.897 70.850 84.675
q1-iso-qk 2060 0.787 4.755 4.852 4.572 8.539

The AD part in Gridap is probably very involved...