Strange effects with inlining and/or name resolution?

mbauman commented 9 years ago

This has bitten me more than once, and it's really tough to figure out what's going on. This package is really awesome, but I still sometimes run into this snag. Instead of splicing the expression to benchmark directly into the benchmarking function, could we wrap it with a @noinline function first? I ran into this while testing out the new checkbounds API at https://github.com/JuliaLang/julia/pull/11895#issuecomment-130335729. You won't see these sorts of crazy effects with the stock checkbounds, but this is a really good demonstration:

julia> A = zeros(10,10);
       const B = zeros(10,10);

julia> @benchmark checkbounds(A, 1)
================ Benchmark Results ========================
   Average elapsed time: 386.45 ns
     95% CI for average: [376.83 ns, 396.06 ns]
   Minimum elapsed time: 393.15 ns
                GC time: 0.00%
       Memory allocated: 16 bytes
  Number of allocations: 1 allocations
      Number of samples: 2601
        R² of OLS model: 0.957
Time used for benchmark: 0.12s
            Precompiled: true
       Multiple samples: true
       Search performed: true

julia> @benchmark checkbounds(B, 1)
================ Benchmark Results ========================
   Average elapsed time: 1.12 ns
     95% CI for average: [1.10 ns, 1.14 ns]
   Minimum elapsed time: 1.52 ns
                GC time: 0.00%
       Memory allocated: 0 bytes
  Number of allocations: 0 allocations
      Number of samples: 6201
        R² of OLS model: 0.956
Time used for benchmark: 0.05s
            Precompiled: true
       Multiple samples: true
       Search performed: true

julia> @noinline f(A) = checkbounds(A, 1)
f (generic function with 1 method)

julia> @benchmark f(A)
================ Benchmark Results ========================
   Average elapsed time: 21.45 ns
     95% CI for average: [21.06 ns, 21.83 ns]
   Minimum elapsed time: 22.36 ns
                GC time: 0.00%
       Memory allocated: 0 bytes
  Number of allocations: 0 allocations
      Number of samples: 5801
        R² of OLS model: 0.950
Time used for benchmark: 0.06s
            Precompiled: true
       Multiple samples: true
       Search performed: true

julia> @benchmark f(B)
================ Benchmark Results ========================
   Average elapsed time: 3.74 ns
     95% CI for average: [3.68 ns, 3.80 ns]
   Minimum elapsed time: 3.99 ns
                GC time: 0.00%
       Memory allocated: 0 bytes
  Number of allocations: 0 allocations
      Number of samples: 6701
        R² of OLS model: 0.952
Time used for benchmark: 0.06s
            Precompiled: true
       Multiple samples: true
       Search performed: true

Whoa now, will the real result please stand up? As far as I can figure out, this is happening due to a crazy confluence of things:

checkbounds should be inlined
but esc(core) ends up splicing this into @benchmarkable: #673#out = (Main.checkbounds)(Main.A,1)
Since Main.A isn't const, it has to look it up and resolve method dispatch every time
Since it calls checkbounds via jl_apply_generic, it has to allocate a tuple for the vararg creating a GC frame

So… how do we resolve this? Putting expressions into a @noinline'ed function may help resolve #5 by hiding a bit more from LLVM's optimizations. Is it possible to determine non-constant bindings at macro eval time? It seems like that's what we'd need to mitigate the name resolution craziness:

Make all non-constant bindings arguments to the @noinline'ed function, which wraps the expression
Copy the non-constant bindings into the @benchmarkable function with a type assertion that was spliced in at parse-time, e.g., A = Main.A::$(typeof(Main.A)). This will allow dispatch to work normally, and will throw an error if Main.A ends up changing, which seems reasonable.

yuyichao commented 9 years ago

This is what I used when I realized this problem. (Incidentally it is also used to benchmark bounds check.....)

Basically the point is evaluate the arguments and only benchmarking the out-most function call, would this be too restrictive? (Also, with another wrapper it may not be necessary to evaluate the argument at macro expansion time.)

Another issue for this approach is that this will add additional overhead if the function you want to evaluate is not inlined by the compiler. IIRC, when I was playing with rand, this overhead completely shadows everything else. A call site annotation of inlining would be nice....

mbauman commented 9 years ago

Yes, I thought about doing that, too, but I wanted to retain the freedom of arbitrary expressions that the @benchmark macro currently provides. My pull request over at #17 could easily be modified to only accept a single function call and evaluate all arguments within the "outer" function. Perhaps that could be provided by a different macro (@bench_function or somesuch). Or JMW could decide that it's best to only allow benchmarking of functions and we could then change the API.

As far as the additional function call for the @noinline'ed inner function, I actually like that it is always there. This allows for comparisons between functions that inline and those that don't (since there will be the cost of two function calls instead of just one).

After using it for a little while, I actually really like the non-constant-binding heuristic in #17. It almost always just does what I want. And it's really easy to change just by using a temporary variable or declaring something as const.

johnmyleswhite commented 9 years ago

Should be resolved now.

johnmyleswhite / Benchmarks.jl

Strange effects with inlining and/or name resolution? #16