Evaluate and implement `EllipsisNotation.jl` for dimension-agnostic code

sloede commented 4 years ago

The package EllipsisNotation.jl allows to use .. in place of an arbitrary number of : slice operators in a multi-dimensional array. It means "all of the columns before (or after)". This could be very helpful in simplifying code like this

      if ndims(dg) == 2
        dg.elements.u[v, :, :, :] = read(file["variables_$v"])
      elseif ndims(dg) == 3
        dg.elements.u[v, :, :, :, :] = read(file["variables_$v"])
      else

to this

        dg.elements.u[v, ..] = read(file["variables_$v"])

However, before using this, we should first benchmark it for performance (CPU time and memory usage), as usually there is no free lunch.

ranocha commented 4 years ago

A typical REPL session monitoring the performance of some code in Trixi while modifying it could look like

julia> using Revise, BenchmarkTools; using Trixi

julia> Trixi.init_parameters("examples/2d/parameters_sedov_blast_wave_shockcapturing_amr.toml");

julia> mesh, dg, time_parameters, time_integration_function = Trixi.init_simulation();

julia> @benchmark Trixi.prolong2mortars!(
           $(dg),
           $(dg.mortar_type))

One could also call Trixi.run_simulation(mesh, dg, time_parameters, time_integration_function) to use not only the initial condition.

efaulhaber commented 4 years ago

It seems like code that can be simplified using EllipsisNotation does not appear in performance-critical parts. That's why I first tested basic examples of EllipsisNotation without Trixi. I could not find any differences in performance (time and memory) comparing classical notation with EllipsisNotation.

In Trixi code that can be simplified using EllipsisNotation appears in run.jl (only once to check steady-state integration residual), io.jl (for restart files) and timedisc_euler_gravity.jl. The latter seems to me like the only worthy candidate for performance testing.

I tested the function update_gravity! where

if ndims(solver) == 2
  if maximum(abs, @views solver.elements.u_t[1, :, :, :]) <= solver.equations.resid_tol
    # println("  Gravity solution tolerance ", solver.equations.resid_tol,
    #         " reached in iterations ",iteration)
    finalstep = true
  end
else
  if maximum(abs, @views solver.elements.u_t[1, :, :, :, :]) <= solver.equations.resid_tol
    # println("  Gravity solution tolerance ", solver.equations.resid_tol,
    #         " reached in iterations ",iteration)
    finalstep = true
  end
end

can be simplified to

if maximum(abs, @views solver.elements.u_t[1, ..]) <= solver.equations.resid_tol
  # println("  Gravity solution tolerance ", solver.equations.resid_tol,
  #         " reached in iterations ",iteration)
  finalstep = true
end

with BenchmarkTools. I extracted the line

update_gravity!(solver_gravity, solver_euler.elements.u, gravity_parameters)

in run_euler_gravity.jl to benchmark with

b = @benchmarkable update_gravity!($solver_gravity, $solver_euler.elements.u, $gravity_parameters)

and ran the benchmark in the REPL.

The result with classical notation looks like this:

julia> run(b)
BenchmarkTools.Trial:
  memory estimate:  742.61 KiB
  allocs estimate:  4462
  --------------
  minimum time:     981.200 μs (0.00% GC)
  median time:      1.067 ms (0.00% GC)
  mean time:        1.338 ms (5.17% GC)
  maximum time:     65.855 ms (0.00% GC)
  --------------
  samples:          3733
  evals/sample:     1

The result with EllipsisNotation looks like this:

julia> run(b)
BenchmarkTools.Trial: 
  memory estimate:  743.80 KiB
  allocs estimate:  4509      
  --------------
  minimum time:     989.801 μs (0.00% GC)
  median time:      1.078 ms (0.00% GC)
  mean time:        1.296 ms (5.46% GC)
  maximum time:     35.556 ms (0.00% GC)
  --------------
  samples:          3861
  evals/sample:     1

While the times are similar, the version with EllipsisNotation causes a few more allocations. Should we do further testing? If so, how exactly would we do it?

sloede commented 4 years ago

This looks already very promising, thanks! Have you also tried an "artificial" test where the compiler does not know the type of the array that is being indexed? I am thinking about something along the following lines

struct Wrapper
  wrapped::Any
end

function foo()
  u = rand(5, 5, 5, 5, 100)
  w = Wrapper(u)

  bar(w)
end

function bar(w)
  @views maximum(abs, w.wrapped[1, ..])
end

I am not sure if I constructed the example correctly, but I would find it interesting to see if the unknown type of w.wrapped causes an additional and measurable performance impact for the ellipsis notation.

efaulhaber commented 4 years ago

With your example I get

julia> @benchmark foo()
BenchmarkTools.Trial:
  memory estimate:  488.75 KiB
  allocs estimate:  11
  --------------
  minimum time:     42.300 μs (0.00% GC)
  median time:      50.600 μs (0.00% GC)
  mean time:        93.486 μs (21.82% GC)
  maximum time:     4.911 ms (96.74% GC)
  --------------
  samples:          10000
  evals/sample:     1

with classical notation and

julia> @benchmark foo()
BenchmarkTools.Trial:
  memory estimate:  489.55 KiB
  allocs estimate:  50
  --------------
  minimum time:     45.500 μs (0.00% GC)
  median time:      53.099 μs (0.00% GC)
  mean time:        100.579 μs (21.32% GC)
  maximum time:     5.009 ms (96.54% GC)
  --------------
  samples:          10000
  evals/sample:     1

with EllipsisNotation.

Classical notation is about 5% faster and uses less allocations with about the same memory estimate.

sloede commented 4 years ago

Hm. Thinking about it, my example code was not very good... now you're benchmarking not just ellipsis notation but also general allocations, which can hide a lot. Maybe something more along those lines:

struct Wrapper
  wrapped::Any
end

function foo()
  u = rand(5, 5, 5, 5, 100)
  w = Wrapper(u)

  @benchmark bar1($w)
  @benchmark bar2($w)
end

function bar1(w)
  @views maximum(abs, w.wrapped[1, ..])
end

function bar2(w)
  @views maximum(abs, w.wrapped[1, :, :, :, :])
end

Or am I still missing something here, @ranocha ?

ranocha commented 4 years ago

Needs to be

function foo()
  u = rand(5, 5, 5, 5, 100)
  w = Wrapper(u)

  display(@benchmark bar1($w))
  display(@benchmark bar2($w))
end

(You forgot the $).

efaulhaber commented 4 years ago

Okay, now there is quite a difference:

bar1
BenchmarkTools.Trial: 
  memory estimate:  1.11 KiB
  allocs estimate:  47
  --------------
  minimum time:     8.800 μs (0.00% GC) 
  median time:      9.100 μs (0.00% GC) 
  mean time:        9.218 μs (0.00% GC) 
  maximum time:     22.333 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     3

bar2
BenchmarkTools.Trial: 
  memory estimate:  320 bytes
  allocs estimate:  8
  --------------
  minimum time:     5.817 μs (0.00% GC)
  median time:      5.950 μs (0.00% GC)
  mean time:        5.974 μs (0.00% GC)
  maximum time:     10.967 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     6

ranocha commented 4 years ago

But why do you want to know what's going on when the compiler doesn't know the type of the array? That case should hopefully not occur in any performance-critical part...

ranocha commented 4 years ago

Anyway, there is certainly a difference between bar1 and bar2, even for a type-stable wrapper.

ranocha commented 4 years ago

And that difference seems mainly due to the creation of the view.

sloede commented 4 years ago

But why do you want to know what's going on when the compiler doesn't know the type of the array?

Mainly because my intuition about performance in Julia has been wrong so many times already that I cannot count ;-)

OK, so I also did some additional checks and it seems like that the memory usage is ~3x higher for the ellipsis version compared to the colon version. This does seem to be independent on the dimension lengths (as expected) and of the number of dimensions of the arrays (which I was not sure about). Similarly, the compute time is always slightly higher for the ellipsis version, but not that much.

Could you maybe do one last check, @erik-f, and add a @timeit timer() ... macro around the steady-state check in run.jl, and then test the performance for a hypdiff setup in 2D and 3D? Since, as you correctly noted, the steady-state check is the only performance-critical part, if it is no problem there, I think we should be ok overall.

@erik-f Just make sure that the problem size (number of elements and/or polydeg) is sufficiently large to get accurate measurements from the timers. According to https://github.com/KristofferC/TimerOutputs.jl#overhead, the overhead is ~0.25 microseconds, so if a single measurement on average takes more than 50 microseconds, you should be ok.

efaulhaber commented 4 years ago

With classical notation I get:

julia> Trixi.run("examples/3d/parameters_hyp_diff_llf.toml", initial_refinement_level=4, t_end=0.2)
[...]
 --------------------------------------------------------------------------------------------
                   Trixi.jl                          Time                   Allocations
                                             ----------------------   -----------------------
              Tot / % measured:                   5.24s / 90.4%            428MiB / 51.8%

 Section                             ncalls     time   %tot     avg     alloc   %tot      avg
 --------------------------------------------------------------------------------------------
 main loop                                1    4.70s  99.4%   4.70s    196MiB  88.3%   196MiB
   rhs                                  505    3.41s  72.1%  6.76ms   66.8MiB  30.1%   136KiB
     interface flux                     505    717ms  15.2%  1.42ms   6.67MiB  3.01%  13.5KiB
     volume integral                    505    643ms  13.6%  1.27ms   6.74MiB  3.04%  13.7KiB
     source terms                       505    584ms  12.3%  1.16ms   6.69MiB  3.01%  13.6KiB
     prolong2interfaces                 505    448ms  9.47%   888μs   6.67MiB  3.00%  13.5KiB
     surface integral                   505    415ms  8.76%   821μs   6.71MiB  3.02%  13.6KiB
     reset ∂u/∂t                        505    402ms  8.49%   796μs     0.00B  0.00%    0.00B
     Jacobian                           505    100ms  2.10%   197μs   6.66MiB  3.00%  13.5KiB
     prolong2boundaries                 505   36.8ms  0.78%  72.9μs   6.64MiB  2.99%  13.5KiB
     boundary flux                      505   26.9ms  0.57%  53.2μs   6.67MiB  3.00%  13.5KiB
     prolong2mortars                    505   18.0ms  0.38%  35.6μs   6.65MiB  2.99%  13.5KiB
     mortar flux                        505   16.8ms  0.36%  33.3μs   6.72MiB  3.02%  13.6KiB
     positivity preserving limiter      505    247μs  0.01%   489ns     0.00B  0.00%    0.00B
   Runge-Kutta step                     505    661ms  14.0%  1.31ms   6.68MiB  3.01%  13.5KiB
   analyze solution                       1    362ms  7.66%   362ms   62.6MiB  28.2%  62.6MiB
   I/O                                    2   53.1ms  1.12%  26.5ms   60.0MiB  27.0%  30.0MiB
-  steady-state check                   101   51.0ms  1.08%   505μs     0.00B  0.00%    0.00B
   calc_dt                              101    981μs  0.02%  9.71μs     0.00B  0.00%    0.00B
 mesh creation                            1   29.2ms  0.62%  29.2ms   26.0MiB  11.7%  26.0MiB
   creation                               1   19.3ms  0.41%  19.3ms   15.3MiB  6.87%  15.3MiB
   initial refinement                     1   9.87ms  0.21%  9.87ms   10.7MiB  4.82%  10.7MiB
   refinement patches                     1   3.60μs  0.00%  3.60μs     80.0B  0.00%    80.0B
   coarsening patches                     1    800ns  0.00%   800ns     80.0B  0.00%    80.0B
 read parameter file                      1    212μs  0.00%   212μs   17.8KiB  0.01%  17.8KiB
 --------------------------------------------------------------------------------------------

With EllipsisNotation I get:

julia> Trixi.run("examples/3d/parameters_hyp_diff_llf.toml", initial_refinement_level=4, t_end=0.2)
[...]
 --------------------------------------------------------------------------------------------
                   Trixi.jl                          Time                   Allocations
                                             ----------------------   -----------------------
              Tot / % measured:                   5.15s / 93.2%            407MiB / 54.6%

 Section                             ncalls     time   %tot     avg     alloc   %tot      avg
 --------------------------------------------------------------------------------------------
 main loop                                1    4.73s  98.6%   4.73s    196MiB  88.3%   196MiB
   rhs                                  505    3.53s  73.6%  6.99ms   66.8MiB  30.1%   136KiB
     interface flux                     505    720ms  15.0%  1.43ms   6.68MiB  3.01%  13.5KiB
     volume integral                    505    658ms  13.7%  1.30ms   6.74MiB  3.03%  13.7KiB
     source terms                       505    596ms  12.4%  1.18ms   6.69MiB  3.01%  13.6KiB
     prolong2interfaces                 505    470ms  9.81%   931μs   6.67MiB  3.00%  13.5KiB
     surface integral                   505    448ms  9.35%   888μs   6.71MiB  3.02%  13.6KiB
     reset ∂u/∂t                        505    410ms  8.56%   812μs     0.00B  0.00%    0.00B
     Jacobian                           505   91.9ms  1.92%   182μs   6.66MiB  3.00%  13.5KiB
     prolong2boundaries                 505   48.2ms  1.01%  95.4μs   6.64MiB  2.99%  13.5KiB
     boundary flux                      505   36.9ms  0.77%  73.2μs   6.67MiB  3.00%  13.5KiB
     prolong2mortars                    505   20.4ms  0.43%  40.5μs   6.65MiB  2.99%  13.5KiB
     mortar flux                        505   20.3ms  0.42%  40.2μs   6.72MiB  3.02%  13.6KiB
     positivity preserving limiter      505    253μs  0.01%   502ns     0.00B  0.00%    0.00B
   Runge-Kutta step                     505    676ms  14.1%  1.34ms   6.68MiB  3.00%  13.5KiB
   analyze solution                       1    247ms  5.15%   247ms   62.6MiB  28.2%  62.6MiB
   I/O                                    2   68.4ms  1.43%  34.2ms   60.0MiB  27.0%  30.0MiB
-  steady-state check                   101   52.6ms  1.10%   521μs    117KiB  0.05%  1.16KiB
   calc_dt                              101    976μs  0.02%  9.66μs     0.00B  0.00%    0.00B
 mesh creation                            1   66.1ms  1.38%  66.1ms   26.0MiB  11.7%  26.0MiB
   creation                               1   56.5ms  1.18%  56.5ms   15.3MiB  6.87%  15.3MiB
   initial refinement                     1   9.58ms  0.20%  9.58ms   10.7MiB  4.82%  10.7MiB
   refinement patches                     1   4.00μs  0.00%  4.00μs     80.0B  0.00%    80.0B
   coarsening patches                     1    799ns  0.00%   799ns     80.0B  0.00%    80.0B
 read parameter file                      1    785μs  0.02%   785μs   17.8KiB  0.01%  17.8KiB
 --------------------------------------------------------------------------------------------

So no time difference, but a few allocations with EllipsisNotation (compared to none at all with classical notation). The difference here is probably due to the creation of the view as @ranocha pointed out.

sloede commented 4 years ago

Sounds good! However, I'm really wondering: Where do the >1 KiB of allocations per iteration come from? This seems like a ridiculously high number for a single view, so there must be a lot going on behind the scenes that cannot be optimized away with the eliipsis notation. Any ideas?

ranocha commented 4 years ago

These basically come from EllipsisNotation.jl and to_indices for arrays with sufficiently many dimensions (at least 4). There seems to be a cutoff of some inline/tuple-magic that works for 3 indices but not for 4 or more indices. Might be good to open an issue there.

ranocha commented 4 years ago

Xref https://github.com/ChrisRackauckas/EllipsisNotation.jl/issues/20

sloede commented 4 years ago

@ranocha I'd favor including it right now, as I think the performance impact is sufficiently small, and since this would allow me to advance with #167. Would you be OK with that?

ranocha commented 4 years ago

Sure, go ahead, please.

sloede commented 4 years ago

@erik-f can you please go ahead and create a PR to introduce the new notation in the non-dimension-specific locations as discussed?

efaulhaber commented 4 years ago

Alright.

trixi-framework / Trixi.jl

Evaluate and implement `EllipsisNotation.jl` for dimension-agnostic code #179