The following benchmark gives very unexpected result.
julia> using Benchmarks
julia> @benchmark rand()
================ Benchmark Results ========================
Time per evaluation: 5.93 ns [5.86 ns, 6.01 ns]
Proportion of time in GC: 0.00% [0.00%, 0.00%]
Memory allocated: 0.00 bytes
Number of allocations: 0 allocations
Number of samples: 11001
Number of evaluations: 71490001
R² of OLS model: 0.951
Time spent benchmarking: 0.65 s
julia> @benchmark rand(Float32)
================ Benchmark Results ========================
Time per evaluation: 51.57 ns [50.89 ns, 52.24 ns]
Proportion of time in GC: 0.18% [0.00%, 0.38%]
Memory allocated: 16.00 bytes
Number of allocations: 1 allocations
Number of samples: 10601
Number of evaluations: 48829501
R² of OLS model: 0.952
Time spent benchmarking: 2.91 s
The reason is that the type widenning cause the inner function for rand(Float32) to be specialized on Tuple{DataType} rather than Tuple{Type{Float32}} which introduce all kinds of overhead.
The simplest trick I can think of to avoid this issue is to use a staged inner function (which will turn off type widenning...). IIUC, the call site of the inner function should always have the concrete type inferred so this shouldn't introduce additional overhead.
The following benchmark gives very unexpected result.
The reason is that the type widenning cause the inner function for
rand(Float32)
to be specialized onTuple{DataType}
rather thanTuple{Type{Float32}}
which introduce all kinds of overhead.The simplest trick I can think of to avoid this issue is to use a staged inner function (which will turn off type widenning...). IIUC, the call site of the inner function should always have the concrete type inferred so this shouldn't introduce additional overhead.