[FR] `PredictNumItersNeeded()` 1.4 correction factor

ruben-laso commented 2 months ago

Problem description In the function PredictNumItersNeeded() there is this 1.4 correction factor. This causes the time running the experiment to exceed by ~40% the time specified by --benchmark_min_time. Of course, --benchmark_min_time denotes the minimum amount of time to run the benchmark, but an overrun of 40% seems excessive. This is particularly relevant in supercomputers, where CPU time is expensive.

Why is this estimation done, instead of stopping the iterations when the accumulated "iteration time" exceeds the target time?
Is there a reason for selecting 1.4 as a correction factor?
In cases where the execution times are not stable, could this prediction be wrong by a large margin?

Suggested solution I suggest either removing the correction factor or making it configurable (with a default value of 1.0).

Example As shown in the following output (executed with --benchmark_min_time=1s) the real execution time is ~1.4s: $7039 \times 198481 = 1397107759$, $8775 \times 160984 = 1412634600$, ...

------------------------------------------------------------------------------------------------------------------------
Benchmark                                                              Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------------
GNU-TBB/std::adjacent_difference/double/1024/manual_time            7039 ns         7022 ns       198481 bytes_per_second=6.50968Gi/s
GNU-TBB/std::adjacent_find/double/1024/manual_time                  8775 ns         8702 ns       160984 bytes_per_second=2.61075Gi/s
GNU-TBB/std::all_of/double/1024/manual_time                         7704 ns         7529 ns       199585 bytes_per_second=2.97387Gi/s
GNU-TBB/std::any_of/double/1024/manual_time                         4707 ns         4625 ns       301878 bytes_per_second=4.86754Gi/s

When executing the same code with --benchmark_min_time=0.71s ($1/1.4 \simeq 0.71$), the execution times are much closer to 1s: $7681 \times 134530 = 1033324930$, $9265 \times 103410 = 958093650$, ...

------------------------------------------------------------------------------------------------------------------------
Benchmark                                                              Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------------
GNU-TBB/std::adjacent_difference/double/1024/manual_time            7681 ns         7615 ns       134530 bytes_per_second=5.96565Gi/s
GNU-TBB/std::adjacent_find/double/1024/manual_time                  9265 ns         9175 ns       103410 bytes_per_second=2.47288Gi/s
GNU-TBB/std::all_of/double/1024/manual_time                         7996 ns         7572 ns       110333 bytes_per_second=2.86519Gi/s
GNU-TBB/std::any_of/double/1024/manual_time                         4656 ns         4680 ns       192865 bytes_per_second=4.92025Gi/s

dmah42 commented 2 months ago

the multiplier is applied when we've run for less than min time to figure out how many iterations we need to get over the minimum time. we could go with a smaller multiplier, but it will take longer to get up to the minimum time, which (i believe) nets out at the same amount of computation at the end of the day.

ruben-laso commented 2 months ago

The confusing point is that the library will (internally) aim to run the benchmark for "min_time" + 40%. Users might expect the benchmark to run just enough iterations to exceed the 'min_time' threshold.

What would happen if the estimation is wrong? Imagine an extreme situation where the first iteration(s) have very low performance due to a cold cache, but it is very fast after that.

What would be the implications of setting this number to 1.0? Would the function be called several times until the "min_time" is reached?

dmah42 commented 2 months ago

if it was set to 1.0 then we would never hit mintime.

is it true we'd run for min_time + 40%? i think if the first run we try runs longer than mintime then we'd be done. the only time the multiplier would kick in is if an attempt runs for shorter than that time, at which point we'd increase the time of the next run by 40%. the only time we'd run for a time mintime * 1.4 is if the initial run is almost-but-not-quite mintime.

LebedevRI commented 2 months ago

As https://github.com/google/benchmark/blob/08fdf6eb84cc8a5b65d84041257c908de5879bf5/src/benchmark_runner.cc#L317 notes, we:

Run N (start=1) iteration
See if that is >= min time, if it is then we're done
If not, is it less-or-equal than 10% of min time? in which case we goto step 1 with N*=10
But if it is greater than 10% of min time, then we run somewhere between 10xN and 14xN iterations

So yes, we do need that fudge factor, and yes, we clearly can run for 1.4x of the min time. In reality even more because of those initial runs with exponentially larger iteration times.

I'm guessing the factor of 1.4 is "derived" from statistical distribution of the iteration timings. If we underestimate it, and the actual iteration times happen to be faster than the previous ones, we may fall slightly short of the min time (say 0.9x of min time), we'll need to re-run again, and that will end up being ~1.9x overshoot, which more wasteful than 1.4x overshoot.

ruben-laso commented 2 months ago

According to my calculations, in step 4 you would run somewhere between $1.4\times N$ and $14\times N$.

Since GetMinTimeToApply() $>$ i.seconds, we always have GetMinTimeToApply() / i.seconds $>1$. Given that i.seconds / GetMinTimeToApply() $> 0.1$, we have GetMinTimeToApply() / i.seconds $< 10$. Without any correction factor, that means that multiplier would be $1 <$ multiplier $< 10$. When a correction factor is applied, $1.4$ in this case, the range shifts to $1.4 <$ multiplier $< 14$.

Let's follow the example that we fall slightly short of the min time, 0.9x of the min time. Without any correction factor, we could just hit the 1.0x of min time with (roughly) multiplier $= 1/0.9 \simeq 1.11 \ll 1.4$.

LebedevRI commented 2 months ago

According to my calculations, in step 4 you would run somewhere between 1.4 × N and 14 × N .

Since GetMinTimeToApply() > i.seconds, we always have GetMinTimeToApply() / i.seconds > 1 . Given that i.seconds / GetMinTimeToApply() > 0.1 , we have GetMinTimeToApply() / i.seconds < 10 . Without any correction factor, that means that multiplier would be 1 < multiplier < 10 . When a correction factor is applied, 1.4 in this case, the range shifts to 1.4 < multiplier < 14 .

Right.

Let's follow the example that we fall slightly short of the min time, 0.9x of the min time. Without any correction factor, we could just hit the 1.0x of min time with (roughly) multiplier = 1 / 0.9 ≃ 1.11 ≪ 1.4 .

But you do see the problem, right? We will then have spent 1.9x min_time.

LebedevRI commented 2 months ago

Very quick-n-dirty back of the napkin proof:

julia code

```julia using Unitful using Distributions using Random MIN_TIME = 1u"s" IDEAL_NUM_ITERATIONS = 1000 ITERATION_MEAN_TIME = MIN_TIME / IDEAL_NUM_ITERATIONS ITERATION_CoV = 0.01 ITERATION_STDEV = ITERATION_CoV * ITERATION_MEAN_TIME ITERATION_STEP = 10 # TIME_FUDJE = 1.1 SYSTEM_BACKGROUND_JITTER_LIKELYHOOD = 0.25 SYSTEM_BACKGROUND_JITTER_RUNTIME_IMPACT = 0.10 function run_test(;TIME_FUDJE) iterations_total = missing time_total = missing iterations = missing time = missing while true prev_iterations = iterations if iterations isa Missing iterations = 1 elseif (time / MIN_TIME) <= (ITERATION_STEP/100) iterations *= ITERATION_STEP else multiplier = MIN_TIME * TIME_FUDJE / max(time, 1e-9u"s") @show multiplier iterations *= multiplier end iterations = convert(Int64, ceil(iterations)) @assert (prev_iterations isa Missing) || iterations > prev_iterations ds = Bernoulli(SYSTEM_BACKGROUND_JITTER_LIKELYHOOD) di = Normal(ustrip(u"s", ITERATION_MEAN_TIME), ustrip(u"s", ITERATION_STDEV)) background_jitter_scale = [ (!b ? 1 : (1 + SYSTEM_BACKGROUND_JITTER_RUNTIME_IMPACT)) for b in rand(ds, iterations)] iteration_times = background_jitter_scale .* (rand(di, iterations).*u"s") time = sum(iteration_times) time_total = sum(skipmissing([time, time_total])) iterations_total = sum(skipmissing([iterations, iterations_total])) if time > MIN_TIME break end end @assert time > MIN_TIME return (iterations_total, time_total) end @show run_test(TIME_FUDJE=1.4) @show run_test(TIME_FUDJE=1.0) ```

multiplier = 13.672021839106499
run_test(TIME_FUDJE = 1.4) = (1479, 1.5175746366315466 s)
multiplier = 9.686340899121179
multiplier = 1.0067496719348774
run_test(TIME_FUDJE = 1.0) = (2056, 2.1082144456986898 s)

So 1.0 fudge factor ends up being slower.

NOTE: that model is not meant to be authoritative. I'm not sure if modelling system background load as a Bernoulli distribution is the right approach, but i do believe that it is a right first-order approximation.

LebedevRI commented 2 months ago

Some plotting:

julia code

```julia module MyUnits using Unitful #export pt, mm, cm @unit fudgeFactor "×" FudgeFactor 1 false @unit iterations "iters" Iterations 1 false #const mm = u"mm" #const cm = u"cm" Unitful.register(@__MODULE__) function __init__() return Unitful.register(@__MODULE__) end end using Unitful using Distributions using Random using Measurements using Plots plotlyjs() MIN_TIME = 1u"s" IDEAL_NUM_ITERATIONS = 100 ITERATION_MEAN_TIME = MIN_TIME / IDEAL_NUM_ITERATIONS ITERATION_CoV = 1u"percent" ITERATION_STDEV = ITERATION_CoV * ITERATION_MEAN_TIME ITERATION_STEP = 10 AUTHORITATIVE_TIME = 10u"percent" SYSTEM_BACKGROUND_JITTER_LIKELYHOOD = 25u"percent" SYSTEM_BACKGROUND_JITTER_RUNTIME_IMPACT = 0.10 NUM_REPETITIONS = 1000 function run_test_repetition(;TIME_FUDJE) iterations_total = missing time_total = missing iterations = missing time = missing while true prev_iterations = iterations if iterations isa Missing iterations = 1 elseif ((time / MIN_TIME)*100u"percent") <= AUTHORITATIVE_TIME iterations *= ITERATION_STEP else multiplier = MIN_TIME * TIME_FUDJE / max(time, 1e-9u"s") iterations *= multiplier end iterations = convert(Int64, ceil(iterations)) @assert (prev_iterations isa Missing) || iterations > prev_iterations ds = Bernoulli(ustrip(u"percent", SYSTEM_BACKGROUND_JITTER_LIKELYHOOD)/100) di = Normal(ustrip(u"s", ITERATION_MEAN_TIME), ustrip(u"s", ITERATION_STDEV)) background_jitter_scale = [ (!b ? 1 : (1 + SYSTEM_BACKGROUND_JITTER_RUNTIME_IMPACT)) for b in rand(ds, iterations)] iteration_times = background_jitter_scale .* (rand(di, iterations).*u"s") time = sum(iteration_times) iterations_total = sum(skipmissing([iterations, iterations_total])) time_total = sum(skipmissing([time, time_total])) if time > MIN_TIME break end end @assert time > MIN_TIME return (iterations_total, time_total) end function run_test(TIME_FUDJE) v = [run_test_repetition(TIME_FUDJE=TIME_FUDJE) for r in 1:NUM_REPETITIONS] all_iterations = getfield.(v, 1) all_times = getfield.(v, 2) iterations_mean = mean(all_iterations) time_mean = mean(all_times) iterations_std = std(all_iterations; mean=iterations_mean) time_std = std(all_times; mean=time_mean) return (measurement(iterations_mean, iterations_std), measurement(time_mean, time_std)) end FUDGE_MIN = 1.0 FUDGE_MAX = 1.5 FUDGE_STEP = 0.01 TIME_FUDJES = collect(range(FUDGE_MIN,stop=FUDGE_MAX,length=1+ceil(Int64, (FUDGE_MAX-FUDGE_MIN)/FUDGE_STEP))) RESULTS = run_test.(TIME_FUDJES) TIME_FUDJES = TIME_FUDJES.*u"fudgeFactor" # ITERATIONS = getfield.(RESULTS, 1).*u"iterations" TIMES = getfield.(RESULTS, 2) plot(TIME_FUDJES, TIMES, xlabel="fudge factor", ylabel="total time", size=(800,600)) ```

Without touching the rest of the params there ^, with 10ms iterations (x 100 repetitions), we get:

but with 1ms iterations (x 1000 repetitions), we get:

Maybe the factor of 1.4 is too much, but i'm certain that 1.0 is just wrong.

LebedevRI commented 2 months ago

And some more fancy-ness (most things are now sampled from distributions instead of being hardcoded. again, not 100% confident in particular choices):

julia code

```julia module MyUnits using Unitful @unit fudgeFactor "×" FudgeFactor 1 false @unit iterations "iters" Iterations 1 false Unitful.register(@__MODULE__) function __init__() return Unitful.register(@__MODULE__) end end using Base.Threads using Base.Iterators using Unitful using Distributions using Random using Measurements using ProgressMeter using Plots plotlyjs() MIN_TIME = 1u"s" ITERATION_STEP = 10 AUTHORITATIVE_TIME = 10u"percent" #dist_ITERATION_MEAN_TIME = LogUniform(ustrip(u"s", 1u"μs"), ustrip(u"s", 10u"s")) #dist_ITERATION_CoV = LogUniform(0.1/100, 10/100) #dist_SYSTEM_BACKGROUND_JITTER_LIKELYHOOD = LogUniform(0.1/100, 10/100) #dist_SYSTEM_BACKGROUND_JITTER_RUNTIME_IMPACT = LogUniform(0.1/100, 10/100) dist_ITERATION_MEAN_TIME = censored(LogNormal(0, 2.5), ustrip(u"s", 1u"μs"), ustrip(u"s", 1u"s")) dist_ITERATION_CoV = censored(LogNormal(1, 1), 0.1/100, 10/100) dist_SYSTEM_BACKGROUND_JITTER_LIKELYHOOD = censored(LogNormal(2.5, 1), 0.1/100, 10/100) dist_SYSTEM_BACKGROUND_JITTER_RUNTIME_IMPACT = censored(LogNormal(2.5, 1), 0.1/100, 10/100) function run_test_repetition_impl(;TIME_FUDJE, ITERATION_MEAN_TIME, ITERATION_STDEV, SYSTEM_BACKGROUND_JITTER_LIKELYHOOD, SYSTEM_BACKGROUND_JITTER_RUNTIME_IMPACT) dist_SYSTEM_BACKGROUND_JITTER = Bernoulli(SYSTEM_BACKGROUND_JITTER_LIKELYHOOD) dist_ITERATION_TIME = Normal(ITERATION_MEAN_TIME, ITERATION_STDEV) iterations_total = missing time_total = missing iterations = missing time = missing while true prev_iterations = iterations if iterations isa Missing iterations = 1 elseif ((time / MIN_TIME)*100u"percent") <= AUTHORITATIVE_TIME iterations *= ITERATION_STEP else multiplier = MIN_TIME * TIME_FUDJE / max(time, 1e-9u"s") iterations *= multiplier end iterations = convert(Int64, ceil(iterations)) @assert (prev_iterations isa Missing) || iterations > prev_iterations background_jitter_scale = [ (!b ? 1 : (1 + SYSTEM_BACKGROUND_JITTER_RUNTIME_IMPACT)) for b in rand(dist_SYSTEM_BACKGROUND_JITTER, iterations)] iteration_times = background_jitter_scale .* (rand(dist_ITERATION_TIME, iterations).*u"s") time = sum(iteration_times) iterations_total = sum(skipmissing([iterations, iterations_total])) time_total = sum(skipmissing([time, time_total])) if time > MIN_TIME break end end @assert time > MIN_TIME real_iteration_mean_time = time_total / iterations_total ideal_iteration_count = ceil(Int64, MIN_TIME / real_iteration_mean_time) ideal_run_time = real_iteration_mean_time * ideal_iteration_count time_overspent = time_total / ideal_run_time return (iterations_total, time_total, time_overspent) end NUM_INNER_REPETITIONS = 100 function run_test_repetition(;TIME_FUDJE) ITERATION_MEAN_TIME = rand(dist_ITERATION_MEAN_TIME) ITERATION_CoV = rand(dist_ITERATION_CoV) SYSTEM_BACKGROUND_JITTER_LIKELYHOOD = rand(dist_SYSTEM_BACKGROUND_JITTER_LIKELYHOOD) SYSTEM_BACKGROUND_JITTER_RUNTIME_IMPACT = rand(dist_SYSTEM_BACKGROUND_JITTER_RUNTIME_IMPACT) ITERATION_STDEV = ITERATION_CoV * ITERATION_MEAN_TIME v = [run_test_repetition_impl(TIME_FUDJE=TIME_FUDJE, ITERATION_MEAN_TIME=ITERATION_MEAN_TIME, ITERATION_STDEV=ITERATION_STDEV, SYSTEM_BACKGROUND_JITTER_LIKELYHOOD=SYSTEM_BACKGROUND_JITTER_LIKELYHOOD, SYSTEM_BACKGROUND_JITTER_RUNTIME_IMPACT=SYSTEM_BACKGROUND_JITTER_RUNTIME_IMPACT) for r in 1:NUM_INNER_REPETITIONS] next!(p) return v end NUM_REPETITIONS = 10^5 function run_test(TIME_FUDJE) chunk_size = max(1, ceil(Int64, NUM_REPETITIONS // nthreads())) data_chunks = partition(1:NUM_REPETITIONS, chunk_size) tasks = map(data_chunks) do chunk @spawn begin state = [] for x in chunk push!(state, run_test_repetition(TIME_FUDJE=TIME_FUDJE)) end return state end end v = fetch.(tasks) v = vcat(v...) v = vcat(v...) all_iterations = getfield.(v, 1) all_times = getfield.(v, 2) all_times_overspent = getfield.(v, 3) iterations_mean = mean(all_iterations) time_mean = mean(all_times) time_overspent_mean = mean(all_times_overspent) iterations_std = std(all_iterations; mean=iterations_mean) time_std = std(all_times; mean=time_mean) time_overspent_std = std(all_times_overspent; mean=time_overspent_mean) return (measurement(iterations_mean, iterations_std), measurement(time_mean, time_std), measurement(time_overspent_mean, time_overspent_std)) end FUDGE_MIN = 1.0 FUDGE_MAX = 1.5 FUDGE_STEP = 0.01 TIME_FUDJES = collect(range(FUDGE_MIN,stop=FUDGE_MAX,length=1+ceil(Int64, (FUDGE_MAX-FUDGE_MIN)/FUDGE_STEP))) chunk_size = max(1, ceil(Int64, length(TIME_FUDJES) // nthreads())) data_chunks = partition(TIME_FUDJES, chunk_size) p = Progress(NUM_REPETITIONS*length(TIME_FUDJES); dt=1) tasks = map(data_chunks) do chunk @spawn begin state = [] for x in chunk push!(state, run_test(x)) end return state end end RESULTS = fetch.(tasks) RESULTS = vcat(RESULTS...) TIME_FUDJES = TIME_FUDJES.*u"fudgeFactor" # ITERATIONS = getfield.(RESULTS, 1).*u"iterations" TIMES = getfield.(RESULTS, 2) TIMES_OVERSPENT = getfield.(RESULTS, 3) scatter(TIME_FUDJES, Measurements.value.(TIMES_OVERSPENT), #series_annotations = ann, xticks=TIME_FUDJES, yticks=0:0.005:3, xlabel="fudge factor", ylabel="time overspenditure (factor)", size=(1920,1080)) ```

EDIT: the distributions there ^ are wrong, these might be better: (graphs not updated)

```julia using SpecialFunctions function lognormal_σ(;μ,v,num_σ=3) p = erf(num_σ/sqrt(2)) return -(μ - log(v))/(sqrt(2)*erfinv(2*p-1)) end # 0σ = 10ms, +3σ = 1s # NOTE: in reality, mean should be 1ms, but that makes this simulation much slower... μ=log(10/1000) dist_ITERATION_MEAN_TIME = censored(LogNormal(μ, lognormal_σ(μ=μ,v=1000/1000)), ustrip(u"s", 1u"μs"), ustrip(u"s", 1u"s")) # 0σ = 1%, +3σ = 25% μ=log(1/100) dist_ITERATION_CoV = censored(LogNormal(μ, lognormal_σ(μ=μ,v=25/100)), 0.1/100, 25/100) # 0σ = 10%, +3σ = 25% μ=log(10/100) dist_SYSTEM_BACKGROUND_JITTER_LIKELYHOOD = censored(LogNormal(μ, lognormal_σ(μ=μ,v=25/100)), 0.1/100, 50/100) # 0σ = 5%, +3σ = 25% μ=log(5/100) dist_SYSTEM_BACKGROUND_JITTER_RUNTIME_IMPACT = censored(LogNormal(μ, lognormal_σ(μ=μ,v=25/100)), 0.1/100, 50/100) ```

fancy

Again, i'm not quite confident that my "statistical model" is sufficiently precise for anything, but maybe 1.4 could be lowered somewhat, perhaps to 1.10, theoretically.

ruben-laso commented 2 months ago

I agree that the value 1.0 falls into the theoretical (non-real) world.

Looking at your plots, I also agree that something in the 1.02 - 1.10 range might be more sensible.

In any case, I think the best option could be to have an environment variable (or a program argument like --benchmark_min_time_thres) so the user can tweak this if needed instead of a hardcoded constant value.

LebedevRI commented 2 months ago

@dmah42 question: does google retain benchmark .json data for long-term performance tracking? If it does, we maybe could at least derive (with-noise) dist_ITERATION_MEAN_TIME from that real-world data.

We don't encode not nearly enough info into that json (we are missing min_time and use_real_time), so it's would to be somewhat lossy, but i think we could side-step that issue. I think, only the following parts of said json output would be necessary:

{
"run_name": <?>,
"run_type": <?>,
"threads": <?>, % we were scaling by this factor until recently
"iterations": <?>,
"aggregate_name": <?>,
"real_time": <?>,
"cpu_time": <?>,
"time_unit": <?>
}

I'm guessing you'll want to censor "run_name", i think it would be fine to drop everything but/min_time:<>and/use_real_time` from that string.

Out of all the benchmarks that are tracked, have all of them been run at least once since #1836? I think it would be least confusing to only look at the benchmarks //with// that change, (or without it).

Is that possible?

LebedevRI commented 2 months ago

@dmah42 (consider my previous question to be a theoretical question, not a request.)

@ruben-laso

I agree that the value 1.0 falls into the theoretical (non-real) world.

Looking at your plots, I also agree that something in the 1.02 - 1.10 range might be more sensible.

In any case, I think the best option could be to have an environment variable (or a program argument like --benchmark_min_time_thres) so the user can tweak this if needed instead of a hardcoded constant value.

That feels very much placebo-y, as far as i can tell, even with that fudge at 1.1, we still spend 1.5x the time we ideally need to (as per the final iteration time), which is not significantly better than what we'd spent at 1.4:

julia code

```julia module MyUnits using Unitful @unit fudgeFactor "×" FudgeFactor 1 false @unit iterations "iters" Iterations 1 false Unitful.register(@__MODULE__) function __init__() return Unitful.register(@__MODULE__) end end using Base.Threads using Base.Iterators using Unitful using Distributions using Random using Measurements using ProgressMeter using Plots using SpecialFunctions using DataFrames plotlyjs() MIN_TIME = 1u"s" #dist_ITERATION_MEAN_TIME = LogUniform(ustrip(u"s", 1u"μs"), ustrip(u"s", 10u"s")) #dist_ITERATION_CoV = LogUniform(0.1/100, 10/100) #dist_SYSTEM_BACKGROUND_JITTER_LIKELYHOOD = LogUniform(0.1/100, 10/100) #dist_SYSTEM_BACKGROUND_JITTER_RUNTIME_IMPACT = LogUniform(0.1/100, 10/100) function lognormal_σ(;μ,v,num_σ=3) p = erf(num_σ/sqrt(2)) return -(μ - log(v))/(sqrt(2)*erfinv(2*p-1)) end # 0σ = 1ms, +3σ = 1s # NOTE: in reality, mean should be 1ms, but that makes this simulation much slower... μ=log(1/1000) dist_ITERATION_MEAN_TIME = censored(LogNormal(μ, lognormal_σ(μ=μ,v=1000/1000)), ustrip(u"s", 1u"μs"), ustrip(u"s", 1u"s")) # 0σ = 10%, +3σ = 25% μ=log(10/100) dist_ITERATION_CoV = censored(LogNormal(μ, lognormal_σ(μ=μ,v=25/100)), 0.1/100, 25/100) struct BenchmarkResult AUTHORITATIVE_TIME::Float64 ITERATION_COUNT_MULTIPLIER::Int64 TIME_FUDJE::Float64 time_overspent::Any end function run_test_repetition_impl(; AUTHORITATIVE_TIME, ITERATION_COUNT_MULTIPLIER, TIME_FUDJE) ITERATION_MEAN_TIME = rand(dist_ITERATION_MEAN_TIME) ITERATION_CoV = rand(dist_ITERATION_CoV) ITERATION_STDEV = ITERATION_CoV * ITERATION_MEAN_TIME dist_ITERATION_TIME = Normal(ITERATION_MEAN_TIME, ITERATION_STDEV) iterations_total = missing time_total = missing iterations = missing time = missing while true prev_iterations = iterations if iterations isa Missing iterations = 1 elseif (time / MIN_TIME) <= AUTHORITATIVE_TIME iterations *= ITERATION_COUNT_MULTIPLIER else multiplier = MIN_TIME * TIME_FUDJE / max(time, 1e-9u"s") iterations *= multiplier end iterations = convert(Int64, ceil(iterations)) @assert (prev_iterations isa Missing) || iterations > prev_iterations iteration_times = rand(dist_ITERATION_TIME,iterations).*u"s" time = sum(iteration_times) iterations_total = sum(skipmissing([iterations, iterations_total])) time_total = sum(skipmissing([time, time_total])) if time >= MIN_TIME break end end @assert time > MIN_TIME real_iteration_mean_time = time_total / iterations_total ideal_iteration_count = ceil(Int64, MIN_TIME / real_iteration_mean_time) ideal_run_time = real_iteration_mean_time * ideal_iteration_count time_overspent = time_total / ideal_run_time return time_overspent end NUM_BENCHMARK_REPETITIONS = 10^6 function run_test_configuration(; AUTHORITATIVE_TIME, ITERATION_COUNT_MULTIPLIER, TIME_FUDJE) chunk_size = max(1, ceil(Int64, NUM_BENCHMARK_REPETITIONS // nthreads())) data_chunks = partition(1:NUM_BENCHMARK_REPETITIONS, chunk_size) tasks = map(data_chunks) do chunk @spawn begin state = [] for x in chunk push!(state, run_test_repetition_impl( AUTHORITATIVE_TIME=AUTHORITATIVE_TIME, ITERATION_COUNT_MULTIPLIER=ITERATION_COUNT_MULTIPLIER, TIME_FUDJE=TIME_FUDJE)) next!(p) end return state end end v = fetch.(tasks) v = vcat(v...) v = vcat(v...) time_overspent = v time_overspent_mean = mean(time_overspent) time_overspent_std = std(time_overspent; mean=time_overspent_mean) time_overspent = measurement(time_overspent_mean, time_overspent_std) #time_overspent = time_overspent_mean + 3*time_overspent_std v = BenchmarkResult(AUTHORITATIVE_TIME, ITERATION_COUNT_MULTIPLIER, TIME_FUDJE, time_overspent) return v end function run_test(CONFIGURATIONS) chunk_size = max(1, ceil(Int64, length(CONFIGURATIONS) // nthreads())) data_chunks = partition(CONFIGURATIONS, chunk_size) tasks = map(data_chunks) do chunk @spawn begin state = [] for x in chunk push!(state, run_test_configuration(AUTHORITATIVE_TIME=x[1],ITERATION_COUNT_MULTIPLIER=x[2],TIME_FUDJE=x[3])) end return state end end v = fetch.(tasks) v = vcat(v...) v = vcat(v...) return v end #AUTHORITATIVE_TIME = [10/100] #ITERATION_COUNT_MULTIPLIER = [10] #TIME_FUDGE = [1.4] AUTHORITATIVE_TIME = [10/100] ITERATION_COUNT_MULTIPLIER = [10] TIME_FUDGE = [1.0,1.1,1.4] #AUTHORITATIVE_TIME = 1/100:1/100:5/100 #ITERATION_COUNT_MULTIPLIER = 2:1:20 #TIME_FUDGE = 1.0:0.01:1.1 CONFIGURATIONS = [(x,y,z) for x in AUTHORITATIVE_TIME for y in ITERATION_COUNT_MULTIPLIER for z in TIME_FUDGE] NUM_CONFIGURATIONS = length(CONFIGURATIONS) p = Progress(NUM_BENCHMARK_REPETITIONS*NUM_CONFIGURATIONS; dt=1) R = run_test(CONFIGURATIONS) X = [r.ITERATION_COUNT_MULTIPLIER for r in R] Y = [r.AUTHORITATIVE_TIME for r in R] Y2 = [r.TIME_FUDJE for r in R] Z = [r.time_overspent for r in R] df = DataFrame(ITERATION_COUNT_MULTIPLIER=X, AUTHORITATIVE_TIME=Y, TIME_FUDJE=Y2, time_overspent=Z) sort!(df, [:time_overspent]) sort!(df, [order(:time_overspent), order(:AUTHORITATIVE_TIME), order(:ITERATION_COUNT_MULTIPLIER, rev=true), order(:TIME_FUDJE, rev=true)]) @show first(df, 100) @assert false Z = Measurements.value.(Z) #heatmap(X,Y,Z; size=(800,600), # xlabel="ITERATION_COUNT_MULTIPLIER (x)", ylabel="AUTHORITATIVE_TIME (%)", zlabel="time_overspent (x)", # xticks=ITERATION_COUNT_MULTIPLIER, yticks=AUTHORITATIVE_TIME) scatter(X,Y,Z; zcolor=Z, size=(800,600), xlabel="ITERATION_COUNT_MULTIPLIER (x)", ylabel="AUTHORITATIVE_TIME (%)", zlabel="time_overspent (x)", xticks=ITERATION_COUNT_MULTIPLIER, yticks=AUTHORITATIVE_TIME) #scatter(Y2,Z; size=(800,600), # xlabel="TIME_FUDGE", ylabel="time_overspent (x)", # xticks=TIME_FUDGE) ```

first(df, 100) = 3×4 DataFrame
 Row │ ITERATION_COUNT_MULTIPLIER  AUTHORITATIVE_TIME  TIME_FUDJE  time_overspent
     │ Int64                       Float64             Float64     Measurement…
─────┼────────────────────────────────────────────────────────────────────────────
   1 │                         10                 0.1         1.1       1.53±0.28
   2 │                         10                 0.1         1.4       1.82±0.28
   3 │                         10                 0.1         1.0       2.04±0.85

We could also lower AUTHORITATIVE_TIME and maybe change ITERATION_COUNT_MULTIPLIER, and lower the total time spent to as low as 1.2x of the ideal, but really does feel ad-hoc to change anything without better statistical model...

dmah42 commented 2 months ago

the right solution to this is to run until we reduce the confidence interval for the iterations within a run to a particular threshold (or max time to avoid infinite runs). anything other than this is just fine-tuning and is unlikely to scale across all the various use-cases.

if someone wants to reduce 1.4 to 1.2 or 1.1 or whatever then create a PR but i really don't think it matters all that much (and thank you @LebedevRI for basically proving that :D ).

LebedevRI commented 2 months ago

the right solution to this is to run until we reduce the confidence interval for the iterations within a run to a particular threshold (or max time to avoid infinite runs). anything other than this is just fine-tuning and is unlikely to scale across all the various use-cases.

if someone wants to reduce 1.4 to 1.2 or 1.1 or whatever then create a PR but i really don't think it matters all that much (and thank you @LebedevRI for basically proving that :D ).

Yeah, that's my general understanding, too. I haven't yet tried modelling said different convergence approach. Trivia bit: this time_overspent factor for current implementation seems to best described by inverse gaussian distribution.

ruben-laso commented 2 months ago

Hello again,

I was playing with the internals of the library, and I think I now understand how it works. From what I understood, every time we get a new prediction from PredictNumItersNeeded() we "start from scratch". Whatever measurements we had before are forgotten.

Let's use an example: Imagine we want to benchmark the function foo() for 1s. At some point, PredictNumItersNeeded() says that we should run the function 900 times. Unfortunately, we remained at 0.9s, and PredictNumItersNeeded() should be called again, saying that we should have run foo() 1000 times to get to 1s. Now, instead of doing 1000-900 = 100 iterations to fill the gap, we run the 1000 iterations in a single go so, at the end of the process, we run foo() >1900 times.

@LebedevRI That fits with what you mentioned here

Is this correct?

If so, isn't this approach very inefficient? We have to choose to overestimate or start from scratch a lot of times

dmah42 commented 2 months ago

yes that is correct, we start from scratch. i suppose we could change it to do an incremental run and check the running total of iterations and time instead. it would be a reasonably complex change but would improve the efficiency.

efficiency (in terms of how much we run) hasn't been a concern as we've focused on making the inner timing loop as fast as possible to get out of the way of the code under test.

LebedevRI commented 2 months ago

@LebedevRI That fits with what you mentioned here

Is this correct?

That is what i said, yes.

I'm still looking at that model, i need better estimates, but i think we might be able to get closer to efficiency sweet-spot.

yes that is correct, we start from scratch. i suppose we could change it to do an incremental run and check the running total of iterations and time instead. it would be a reasonably complex change but would improve the efficiency.

Can we, though? I was under distinct impression that the current way it's done is intentional, and it is a design guarantee that the benchmark-under-test will run for at least min_time consecutive seconds, and we will report the measurements from that specific run.

ruben-laso commented 2 months ago

As an idea, why not estimate the number of iterations to get to 10% of the min time?

The way it is now, you multiply the iterations by 10 until surpassing the 10% threshold. Let's say you are between 2-9% of your min time. Then, the multiplier will be 10, and you will move to the range of 20 to 90% of your min time.

Instead of multiplying by 10 "blindly", what about making an estimation of the iterations to get to, let's say 15%? In that case, you "wasted" only ~15% of CPU time, but it is big enough to be considered "significant" in the current algorithm and make a better prediction later (reducing the 40% overestimation to something smaller).

Maybe not the best solution, but I think this would alleviate the problem and should be easy to implement.

dmah42 commented 2 months ago

Can we, though? I was under distinct impression that the current way it's done is intentional, and it is a design guarantee that the benchmark-under-test will run for at least min_time consecutive seconds, and we will report the measurements from that specific run.

it's intentional in that we built it that way. but like i said above, if efficiency had been a concern we probably would have built it differently. this was the simplest approach at the time that worked.

why do you think the consecutive aspect matters, or reporting from a single run matters? it shouldn't, right?

ruben-laso commented 2 months ago

I don't think it is really relevant in terms of the report, but it is in terms of computation time.

Not only about the CPU time (and energy) spent without a real outcome but also the human time you need to wait to have the full report. In my case, I have to benchmark a lot of cases and functions, and the full set of benchmarks might take several hours to complete. Saving time would be significant.

For example, below you can see the output of a little example to prove my point. Here I store:

LastRunTime: execution time of the last batch (after the last call to PredictNumItersNeeded()). This number is approx Time * Iterations.
TimesCalled: times that the benchmark was executed. E.g.: 8 would mean that we tried 8 different numbers for iters within the library.
TotalRunTime: the total amount of time running each function. This is the summation of all the calls to VectorLookup, not only the reported ones.

You can see that for having a 1s benchmark (actually ~1.4s), we had to run each function up to 2.50s. That's a 150% overhead! Several times we had to run each function more than 2s. In my opinion, this overhead is far from negligible.

------------------------------------------------------------------------------------------------
Benchmark                 Time             CPU   Iterations LastRunTime TimesCalled TotalRunTime
------------------------------------------------------------------------------------------------
VectorLookup/10         504 ns          504 ns      2752537     1.38822           8      1.96039
VectorLookup/20        3217 ns         3217 ns       453119     1.45753           7      1.80181
VectorLookup/30       10233 ns        10227 ns       140387     1.43665           7      2.54488
VectorLookup/40       22553 ns        22548 ns        61905     1.39615           6      1.64711
VectorLookup/50       42935 ns        42933 ns        31745     1.36298           6      1.85481
VectorLookup/60       73800 ns        73790 ns        19437     1.43445           6      2.23461
VectorLookup/70      119281 ns       119281 ns        11924     1.42231           5      1.55267
VectorLookup/80      173258 ns       173253 ns         8347     1.44619           5      1.63244
VectorLookup/90      244639 ns       244630 ns         5709     1.39665           5      1.66971
VectorLookup/100     336135 ns       336133 ns         4196     1.41043           5      1.78169
VectorLookup/110     429649 ns       429649 ns         3255     1.39851           5       1.8789
VectorLookup/120     575516 ns       575507 ns         2514     1.44685           5      2.06573
VectorLookup/130     720009 ns       719941 ns         1956     1.40834           5      2.20416
VectorLookup/140     965566 ns       965355 ns         1560     1.50629           5      2.50385
VectorLookup/150    1517023 ns      1516741 ns          911     1.38201           4      1.55215
VectorLookup/160    1603279 ns      1603062 ns          856     1.37241           4      1.55395
VectorLookup/170    1655967 ns      1655238 ns          837     1.38605           4      1.57171
VectorLookup/180    1932482 ns      1932454 ns          720     1.39139           4      1.60685
VectorLookup/190    2259932 ns      2259914 ns          621     1.40342           4      1.65399
VectorLookup/200    2631596 ns      2631435 ns          532     1.40001           4      1.69203

LebedevRI commented 2 months ago

As i have said, i'm still looking at the current approach, we may be able to do better without any drastic changes.

LebedevRI commented 2 months ago

julia code

```julia module MyUnits using Unitful @unit fudgeFactor "×" FudgeFactor 1 false @unit iterations "iters" Iterations 1 false Unitful.register(@__MODULE__) function __init__() return Unitful.register(@__MODULE__) end end module Recyclers using Base.Threads: SpinLock struct Recycler{V} garbage_bin::Vector{V} # Guarded by: lock lock::SpinLock function Recycler{V}() where {V} garbage_bin = Vector{V}() lock = SpinLock() return new(garbage_bin, lock) end end function Base.get!(recycler::Recycler{V}) where {V} Base.@lock recycler.lock begin if !isempty(recycler.garbage_bin) return pop!(recycler.garbage_bin) end end return V() end function Base.put!(recycler::Recycler{V}, trash::V) where {V} Base.@lock recycler.lock begin push!(recycler.garbage_bin, trash) end end end using Base.Threads using Base.Iterators using Unitful using Distributions using Random using Measurements using ProgressMeter using Plots using SpecialFunctions using DataFrames Recycler = Recyclers.Recycler plotlyjs() MIN_TIME = 1u"s" #dist_ITERATION_MEAN_TIME = LogUniform(ustrip(u"s", 1u"μs"), ustrip(u"s", 10u"s")) #dist_ITERATION_CoV = LogUniform(0.1/100, 10/100) #dist_SYSTEM_BACKGROUND_JITTER_LIKELYHOOD = LogUniform(0.1/100, 10/100) #dist_SYSTEM_BACKGROUND_JITTER_RUNTIME_IMPACT = LogUniform(0.1/100, 10/100) function tmapreduce(f, op, itr; tasks_per_thread::Int = 1, kwargs...) chunk_size = max(1, length(itr) ÷ (tasks_per_thread * nthreads())) tasks = map(Iterators.partition(itr, chunk_size)) do chunk @spawn mapreduce(f, op, chunk; kwargs...) end mapreduce(fetch, op, tasks; kwargs...) end p(num_σ) = erf(num_σ/sqrt(2)) function lognormal_σ(;μ,v,num_σ=3) return -(μ - log(v))/(sqrt(2)*erfinv(2*p(num_σ)-1)) end # 0σ = 1ms, +3σ = 1s # NOTE: in reality, mean should be 1ms, but that makes this simulation much slower... MEAN_TIME=1/1000 μ=log(MEAN_TIME) dist_ITERATION_MEAN_TIME = censored(LogNormal(μ, lognormal_σ(μ=μ,v=1000/1000)), ustrip(u"s", 1u"μs"), ustrip(u"s", 1u"s")) # 0σ = 10%, +3σ = 25% μ=log(10/100) dist_ITERATION_CoV = censored(LogNormal(μ, lognormal_σ(μ=μ,v=25/100)), 0.1/100, 25/100) struct BenchmarkResult ITERATION_COUNT_MULTIPLIER::Int64 AUTHORITATIVE_TIME::Float64 TIME_FUDGE::Float64 time_overspent::Distribution end function run_test_configuration_instance!(iteration_times::Vector{Float64}, rng::AbstractRNG; dist_ITERATION_TIME::Distribution, AUTHORITATIVE_TIME::Number, ITERATION_COUNT_MULTIPLIER::Number, TIME_FUDGE::Number)::Float64 iterations_total = missing time_total = missing iterations = missing time = missing while true prev_iterations = iterations if iterations isa Missing iterations = 1 else ratio = iterations / time if (time / MIN_TIME) <= AUTHORITATIVE_TIME iterations *= ITERATION_COUNT_MULTIPLIER else multiplier = MIN_TIME * TIME_FUDGE / max(time, 1e-9u"s") iterations *= multiplier end end iterations = convert(Int64, ceil(iterations)) @assert (prev_iterations isa Missing) || iterations > prev_iterations resize!(iteration_times, iterations) rand!(rng, dist_ITERATION_TIME, iteration_times) time = sum(iteration_times)*u"s" iterations_total = sum(skipmissing([iterations, iterations_total])) time_total = sum(skipmissing([time, time_total])) if time >= MIN_TIME break end end @assert time > MIN_TIME real_iteration_mean_time = time_total / iterations_total ideal_iteration_count = ceil(Int64, MIN_TIME / real_iteration_mean_time) ideal_run_time = real_iteration_mean_time * ideal_iteration_count time_overspent = time_total / ideal_run_time return time_overspent end NUM_BENCHMARK_INSTANCES = 10^5 NUM_BENCHMARK_INSTANCES = nthreads() * ceil(Int64, NUM_BENCHMARK_INSTANCES // nthreads()) function run_test_configuration(iteration_times_cache::Recycler{Vector{Float64}}, config, instances)::BenchmarkResult AUTHORITATIVE_TIME, ITERATION_COUNT_MULTIPLIER, TIME_FUDGE = config v::Vector{Float64} = tmapreduce(vcat, enumerate(instances), tasks_per_thread=1) do instance rng = Xoshiro(instance[1]) iteration_times = get!(iteration_times_cache) state = run_test_configuration_instance!( iteration_times, rng, dist_ITERATION_TIME=instance[2], AUTHORITATIVE_TIME=AUTHORITATIVE_TIME, ITERATION_COUNT_MULTIPLIER=ITERATION_COUNT_MULTIPLIER, TIME_FUDGE=TIME_FUDGE) put!(iteration_times_cache, iteration_times) push!(progress, 1) return state end dist::Distribution = fit(InverseGaussian, v) return BenchmarkResult(ITERATION_COUNT_MULTIPLIER, AUTHORITATIVE_TIME, TIME_FUDGE, dist) end function run_test(CONFIGURATIONS)::Vector{BenchmarkResult} instances = [ begin ITERATION_MEAN_TIME = rand(dist_ITERATION_MEAN_TIME) ITERATION_CoV = rand(dist_ITERATION_CoV) ITERATION_STDEV = ITERATION_CoV * ITERATION_MEAN_TIME Normal(ITERATION_MEAN_TIME, ITERATION_STDEV) end for i in 1:NUM_BENCHMARK_INSTANCES] iteration_times_cache = Recycler{Vector{Float64}}() return tmapreduce(vcat, CONFIGURATIONS, tasks_per_thread=1) do configuration run_test_configuration(iteration_times_cache, configuration, view(instances, :)) end end #AUTHORITATIVE_TIME = [10/100] #ITERATION_COUNT_MULTIPLIER = [10] #TIME_FUDGE = [1.4] #AUTHORITATIVE_TIME = 1/100:1/100:20/100 #ITERATION_COUNT_MULTIPLIER = 2:1:20 #TIME_FUDGE = 1.0:0.1:2.0 AUTHORITATIVE_TIME = vcat([0.01/100, 0.1/100], 1/100:1/100:10/100) ITERATION_COUNT_MULTIPLIER = 2:1:10 TIME_FUDGE = reverse(1.05:0.05:1.4) CONFIGURATIONS = [(a,b,c) for a in AUTHORITATIVE_TIME for b in ITERATION_COUNT_MULTIPLIER for c in TIME_FUDGE] NUM_CONFIGURATIONS = length(CONFIGURATIONS) progress = Channel{Union{UInt8,Nothing}}(nthreads()^2, spawn=true) do ch p = Progress(NUM_BENCHMARK_INSTANCES*NUM_CONFIGURATIONS; dt=1.0) if false progress_refresh = Channel{Nothing}(0, spawn=true) do ch while true state = timedwait(() -> isready(ch), 1.0, pollint=0.1) if state == :timed_out next!(p, step=0, force=true) elseif state == :ok @assert take!(ch) === nothing return end end end end while (step = take!(ch)) !== nothing next!(p, step=convert(Int, step)) end #push!(progress_refresh, nothing) end; @timev R = run_test(CONFIGURATIONS) @show R push!(progress, nothing) A = [r.ITERATION_COUNT_MULTIPLIER for r in R] B = [r.AUTHORITATIVE_TIME for r in R] C = [r.TIME_FUDGE for r in R] V = [r.time_overspent for r in R] #-- ITERATION_COUNT_MULTIPLIER = collect(ITERATION_COUNT_MULTIPLIER) AUTHORITATIVE_TIME = collect(AUTHORITATIVE_TIME) TIME_FUDGE = collect(TIME_FUDGE) V_2D::Vector{Matrix{Any}} = fill(fill(missing, (length(AUTHORITATIVE_TIME), length(TIME_FUDGE))), length(ITERATION_COUNT_MULTIPLIER)) idx(v,s) = findfirst(x->x==v, s) for r in R v = r.time_overspent V_2D[idx(r.ITERATION_COUNT_MULTIPLIER, ITERATION_COUNT_MULTIPLIER)][idx(r.AUTHORITATIVE_TIME, AUTHORITATIVE_TIME), idx(r.TIME_FUDGE, TIME_FUDGE)] = v end V_2D_val = [ quantile.(m, p(3)) for m in V_2D ] V_2D_std = [ std.(m) for m in V_2D ] function normalize(m, min, max) m = m .- min max -= min m = m ./ max return m end function normbounds(m) b = extrema.(m) q = extrema([(b...)...]) return normalize.(b, q...) end normbounds_val = normbounds(V_2D_val) normbounds_std = normbounds(V_2D_std) print("\n") #-- using ColorSchemes using Plots using Plots.PlotMeasures plotlyjs() function cntr(v, normbounds, digits, stepmult, color, title) best = findmin(v) color = cgrad([get(color, p) for p in normbounds]) step = 10.0^(-digits)/stepmult levels = round(minimum(v), RoundDown, digits=digits)-step:step:round(maximum(v), RoundUp, digits=digits)+step c = contour(AUTHORITATIVE_TIME,TIME_FUDGE,transpose(v), title=title, levels=levels, contour_labels=true, fill=true, color=color, alpha = 0.25, bottom_margin=40px) c = scatter!([AUTHORITATIVE_TIME[best[2][1]]],[TIME_FUDGE[best[2][2]]], mark=:diamond, color=:gold, legend=false) return c end plots = [] for i in 1:length(ITERATION_COUNT_MULTIPLIER) push!(plots, cntr(V_2D_val[i], normbounds_val[i], 1, 2, cgrad(:diverging_gwr_55_95_c38_n256), "+3σ, ITERATION_COUNT_MULTIPLIER="*string(ITERATION_COUNT_MULTIPLIER[i]))) push!(plots, cntr(V_2D_std[i], normbounds_std[i], 2, 1, cgrad(:diverging_bky_60_10_c30_n256), "std, ITERATION_COUNT_MULTIPLIER="*string(ITERATION_COUNT_MULTIPLIER[i]))) end # size = (1400,400*length(ITERATION_COUNT_MULTIPLIER)) size = (1920, 4*1080) plot(plots..., size=size, layout = grid(length(ITERATION_COUNT_MULTIPLIER), 2), xlabel="AUTHORITATIVE_TIME", ylabel="TIME_FUDGE (x)", xticks=AUTHORITATIVE_TIME, yticks=TIME_FUDGE) ```

(We are currently at top-right corner of the bottom graph) newplot

One thing that i haven't thought about is, there may be costly setup code outside of the benchmarked section, and it's timings are not at all modelled here. Without accounting for that, decreasing ITERATION_COUNT_MULTIPLIER is perhaps a foodgun.

ruben-laso commented 2 months ago

One thing that i haven't thought about is, there may be costly setup code outside of the benchmarked section, and it's timings are not at all modelled here. Without accounting for that, decreasing ITERATION_COUNT_MULTIPLIER is perhaps a foodgun.

@LebedevRI that's a very good point, in that case the aim should be to reduce the amount of times the function PredictNumItersNeeded() is called, right?

I implemented this idea:

Instead of multiplying by 10 "blindly", what about making an estimation of the iterations to get to, let's say 15%? In that case, you "wasted" only ~15% of CPU time, but it is big enough to be considered "significant" in the current algorithm and make a better prediction later (reducing the 40% overestimation to something smaller).

3-phase approach to determine the number of iterations needed.

Multiply by 10 while we are below 1% of the min time to reach a "semi-significant time".
Use "semi-significant time" to find the iterations needed to reach 10% of the min time (+5% buffer) -> "significant time".
Use "significant time" to find the iterations to reach 100% of the min time + (5% buffer)

New implementation of `PredictNumItersNeeded()`

```cpp IterationCount BenchmarkRunner::PredictNumItersNeeded( const IterationResults& i) const { // Try a 3-phase approach to determine the number of iterations needed. // 1. Multiply by 10 while we are below 1% of the min time -> semi-significant // time. // 2. Find the iterations needed to reach 10% of the min time (+5% buffer) -> // significant time. // 3. Find the iterations to reach 100% of the min time + (5% buffer) static constexpr auto kSemiSignificantTime = 0.01; static constexpr auto kSignificantTime = 0.10; static constexpr auto kSafetyBuffer = 0.05; const auto significance = i.seconds / GetMinTimeToApply(); // See how much "iters" should be increased by. double multiplier = 1.0; if (significance < kSemiSignificantTime) { // 1. Multiply by 10 while we are below 1% of the min time. multiplier = 10.0; } else if (significance < kSignificantTime) { // 2. Find the number of iterations needed to reach 10% of the min time. // Use +5% buffer to avoid running short. static constexpr auto scale = kSignificantTime + kSafetyBuffer; multiplier = (GetMinTimeToApply() * scale) / std::max(i.seconds, 1e-9); } else { // 3. Find the number of iterations to get to the 100% of the min time. // Use +5% buffer to avoid running short and needing to run again. static constexpr auto scale = 1.0 + kSafetyBuffer; multiplier = (GetMinTimeToApply() * scale) / std::max(i.seconds, 1e-9); } // So what seems to be the sufficiently-large iteration count? Round up. const IterationCount max_next_iters = static_cast( std::llround(std::max(multiplier * static_cast(i.iters), static_cast(i.iters) + 1.0))); // But we do have *some* limits though.. const IterationCount next_iters = std::min(max_next_iters, kMaxIterations); BM_VLOG(3) << "Next iters: " << next_iters << ", " << multiplier << "\n"; return next_iters; // round up before conversion to integer. } ```

With the previous benchmark, it seems that the overhead is reduced from 50-150% to ~20%

------------------------------------------------------------------------------------------------
Benchmark                 Time             CPU   Iterations LastRunTime TimesCalled TotalRunTime
------------------------------------------------------------------------------------------------
VectorLookup/10         379 ns          379 ns      2755924     1.04467           8       1.2279
VectorLookup/20        2783 ns         2783 ns       361921     1.00727           7       1.1888
VectorLookup/30        8494 ns         8494 ns       123341     1.04762           7      1.29101
VectorLookup/40       20736 ns        20736 ns        49884      1.0344           6      1.20582
VectorLookup/50       39235 ns        39235 ns        27109     1.06364           6      1.25163
VectorLookup/60       68139 ns        68136 ns        15541     1.05895           6      1.28128
VectorLookup/70      104222 ns       104219 ns         9835     1.02502           5      1.19017
VectorLookup/80      160129 ns       160127 ns         6548     1.04853           5      1.21012
VectorLookup/90      218037 ns       218035 ns         4742     1.03394           5      1.20603
VectorLookup/100     305099 ns       305098 ns         3424     1.04466           5      1.22355
VectorLookup/110     401033 ns       401023 ns         2644     1.06033           5      1.25001
VectorLookup/120     540107 ns       540099 ns         2010     1.08562           5      1.28743
VectorLookup/130     652292 ns       652282 ns         1595     1.04041           5      1.26183
VectorLookup/140     835440 ns       835427 ns         1268     1.05934           5      1.29917
VectorLookup/150    1262388 ns      1262372 ns          808     1.02001           4      1.18841
VectorLookup/160    1380183 ns      1380157 ns          745     1.02824           4      1.18951
VectorLookup/170    1454011 ns      1453913 ns          707     1.02799           4      1.19032
VectorLookup/180    1740274 ns      1740252 ns          589     1.02502           4      1.19431
VectorLookup/190    2029587 ns      2029546 ns          510     1.03509           4       1.2063
VectorLookup/200    2393218 ns      2393183 ns          432     1.03387           4      1.20711

I would even say that step 1 could use the same kind of approach as the other steps to reduce the amount of times that the iterations have to be recalculated

google / benchmark

[FR] `PredictNumItersNeeded()` 1.4 correction factor #1848