Open ruben-laso opened 2 months ago
the multiplier is applied when we've run for less than min time to figure out how many iterations we need to get over the minimum time. we could go with a smaller multiplier, but it will take longer to get up to the minimum time, which (i believe) nets out at the same amount of computation at the end of the day.
The confusing point is that the library will (internally) aim to run the benchmark for "min_time" + 40%. Users might expect the benchmark to run just enough iterations to exceed the 'min_time' threshold.
What would happen if the estimation is wrong? Imagine an extreme situation where the first iteration(s) have very low performance due to a cold cache, but it is very fast after that.
What would be the implications of setting this number to 1.0? Would the function be called several times until the "min_time" is reached?
if it was set to 1.0 then we would never hit mintime.
is it true we'd run for min_time + 40%? i think if the first run we try runs longer than mintime then we'd be done. the only time the multiplier would kick in is if an attempt runs for shorter than that time, at which point we'd increase the time of the next run by 40%. the only time we'd run for a time mintime * 1.4 is if the initial run is almost-but-not-quite mintime.
As https://github.com/google/benchmark/blob/08fdf6eb84cc8a5b65d84041257c908de5879bf5/src/benchmark_runner.cc#L317 notes, we:
So yes, we do need that fudge factor, and yes, we clearly can run for 1.4x of the min time. In reality even more because of those initial runs with exponentially larger iteration times.
I'm guessing the factor of 1.4 is "derived" from statistical distribution of the iteration timings. If we underestimate it, and the actual iteration times happen to be faster than the previous ones, we may fall slightly short of the min time (say 0.9x of min time), we'll need to re-run again, and that will end up being ~1.9x overshoot, which more wasteful than 1.4x overshoot.
According to my calculations, in step 4 you would run somewhere between $1.4\times N$ and $14\times N$.
Since GetMinTimeToApply()
$>$ i.seconds
, we always have GetMinTimeToApply()
/ i.seconds
$>1$.
Given that i.seconds
/ GetMinTimeToApply()
$> 0.1$, we have GetMinTimeToApply()
/ i.seconds
$< 10$.
Without any correction factor, that means that multiplier
would be $1 <$ multiplier
$< 10$.
When a correction factor is applied, $1.4$ in this case, the range shifts to $1.4 <$ multiplier
$< 14$.
Let's follow the example that we fall slightly short of the min time, 0.9x of the min time. Without any correction factor, we could just hit the 1.0x of min time with (roughly) multiplier
$= 1/0.9 \simeq 1.11 \ll 1.4$.
According to my calculations, in step 4 you would run somewhere between 1.4 × N and 14 × N .
Since
GetMinTimeToApply()
>i.seconds
, we always haveGetMinTimeToApply()
/i.seconds
> 1 . Given thati.seconds
/GetMinTimeToApply()
> 0.1 , we haveGetMinTimeToApply()
/i.seconds
< 10 . Without any correction factor, that means thatmultiplier
would be 1 <multiplier
< 10 . When a correction factor is applied, 1.4 in this case, the range shifts to 1.4 <multiplier
< 14 .
Right.
Let's follow the example that we fall slightly short of the min time, 0.9x of the min time. Without any correction factor, we could just hit the 1.0x of min time with (roughly)
multiplier
= 1 / 0.9 ≃ 1.11 ≪ 1.4 .
But you do see the problem, right? We will then have spent 1.9x min_time.
Very quick-n-dirty back of the napkin proof:
multiplier = 13.672021839106499
run_test(TIME_FUDJE = 1.4) = (1479, 1.5175746366315466 s)
multiplier = 9.686340899121179
multiplier = 1.0067496719348774
run_test(TIME_FUDJE = 1.0) = (2056, 2.1082144456986898 s)
So 1.0
fudge factor ends up being slower.
NOTE: that model is not meant to be authoritative. I'm not sure if modelling system background load as a Bernoulli distribution is the right approach, but i do believe that it is a right first-order approximation.
Some plotting:
Without touching the rest of the params there ^, with 10ms iterations (x 100 repetitions), we get:
but with 1ms iterations (x 1000 repetitions), we get:
Maybe the factor of 1.4 is too much, but i'm certain that 1.0 is just wrong.
And some more fancy-ness (most things are now sampled from distributions instead of being hardcoded. again, not 100% confident in particular choices):
Again, i'm not quite confident that my "statistical model" is sufficiently precise for anything,
but maybe 1.4
could be lowered somewhat, perhaps to 1.10
, theoretically.
I agree that the value 1.0 falls into the theoretical (non-real) world.
Looking at your plots, I also agree that something in the 1.02 - 1.10 range might be more sensible.
In any case, I think the best option could be to have an environment variable (or a program argument like --benchmark_min_time_thres
) so the user can tweak this if needed instead of a hardcoded constant value.
@dmah42 question: does google retain benchmark .json
data for long-term performance tracking?
If it does, we maybe could at least derive (with-noise) dist_ITERATION_MEAN_TIME
from that real-world data.
We don't encode not nearly enough info into that json (we are missing min_time
and use_real_time
),
so it's would to be somewhat lossy, but i think we could side-step that issue.
I think, only the following parts of said json output would be necessary:
{
"run_name": <?>,
"run_type": <?>,
"threads": <?>, % we were scaling by this factor until recently
"iterations": <?>,
"aggregate_name": <?>,
"real_time": <?>,
"cpu_time": <?>,
"time_unit": <?>
}
I'm guessing you'll want to censor "run_name", i think it would be fine to drop everything but
/min_time:<>and
/use_real_time` from that string.
Out of all the benchmarks that are tracked, have all of them been run at least once since #1836? I think it would be least confusing to only look at the benchmarks //with// that change, (or without it).
Is that possible?
@dmah42 (consider my previous question to be a theoretical question, not a request.)
@ruben-laso
I agree that the value 1.0 falls into the theoretical (non-real) world.
Looking at your plots, I also agree that something in the 1.02 - 1.10 range might be more sensible.
In any case, I think the best option could be to have an environment variable (or a program argument like
--benchmark_min_time_thres
) so the user can tweak this if needed instead of a hardcoded constant value.
That feels very much placebo-y, as far as i can tell, even with that fudge at 1.1, we still spend 1.5x the time
we ideally need to (as per the final iteration time), which is not significantly better than what we'd spent at 1.4
:
first(df, 100) = 3×4 DataFrame
Row │ ITERATION_COUNT_MULTIPLIER AUTHORITATIVE_TIME TIME_FUDJE time_overspent
│ Int64 Float64 Float64 Measurement…
─────┼────────────────────────────────────────────────────────────────────────────
1 │ 10 0.1 1.1 1.53±0.28
2 │ 10 0.1 1.4 1.82±0.28
3 │ 10 0.1 1.0 2.04±0.85
We could also lower AUTHORITATIVE_TIME
and maybe change ITERATION_COUNT_MULTIPLIER
,
and lower the total time spent to as low as 1.2x
of the ideal,
but really does feel ad-hoc to change anything without better statistical model...
the right solution to this is to run until we reduce the confidence interval for the iterations within a run to a particular threshold (or max time to avoid infinite runs). anything other than this is just fine-tuning and is unlikely to scale across all the various use-cases.
if someone wants to reduce 1.4 to 1.2 or 1.1 or whatever then create a PR but i really don't think it matters all that much (and thank you @LebedevRI for basically proving that :D ).
the right solution to this is to run until we reduce the confidence interval for the iterations within a run to a particular threshold (or max time to avoid infinite runs). anything other than this is just fine-tuning and is unlikely to scale across all the various use-cases.
if someone wants to reduce 1.4 to 1.2 or 1.1 or whatever then create a PR but i really don't think it matters all that much (and thank you @LebedevRI for basically proving that :D ).
Yeah, that's my general understanding, too. I haven't yet tried modelling said different convergence approach.
Trivia bit: this time_overspent
factor for current implementation seems to best described by inverse gaussian distribution.
Hello again,
I was playing with the internals of the library, and I think I now understand how it works.
From what I understood, every time we get a new prediction from PredictNumItersNeeded()
we "start from scratch".
Whatever measurements we had before are forgotten.
Let's use an example:
Imagine we want to benchmark the function foo()
for 1s. At some point, PredictNumItersNeeded()
says that we should run the function 900 times. Unfortunately, we remained at 0.9s, and PredictNumItersNeeded()
should be called again, saying that we should have run foo()
1000 times to get to 1s.
Now, instead of doing 1000-900 = 100 iterations to fill the gap, we run the 1000 iterations in a single go so, at the end of the process, we run foo()
>1900 times.
@LebedevRI That fits with what you mentioned here
Is this correct?
If so, isn't this approach very inefficient? We have to choose to overestimate or start from scratch a lot of times
yes that is correct, we start from scratch. i suppose we could change it to do an incremental run and check the running total of iterations and time instead. it would be a reasonably complex change but would improve the efficiency.
efficiency (in terms of how much we run) hasn't been a concern as we've focused on making the inner timing loop as fast as possible to get out of the way of the code under test.
@LebedevRI That fits with what you mentioned here
Is this correct?
That is what i said, yes.
I'm still looking at that model, i need better estimates, but i think we might be able to get closer to efficiency sweet-spot.
yes that is correct, we start from scratch. i suppose we could change it to do an incremental run and check the running total of iterations and time instead. it would be a reasonably complex change but would improve the efficiency.
Can we, though? I was under distinct impression that the current way it's done is intentional,
and it is a design guarantee that the benchmark-under-test will run for
at least min_time
consecutive seconds, and we will report the measurements from that specific run.
As an idea, why not estimate the number of iterations to get to 10% of the min time?
The way it is now, you multiply the iterations by 10 until surpassing the 10% threshold. Let's say you are between 2-9% of your min time. Then, the multiplier will be 10, and you will move to the range of 20 to 90% of your min time.
Instead of multiplying by 10 "blindly", what about making an estimation of the iterations to get to, let's say 15%? In that case, you "wasted" only ~15% of CPU time, but it is big enough to be considered "significant" in the current algorithm and make a better prediction later (reducing the 40% overestimation to something smaller).
Maybe not the best solution, but I think this would alleviate the problem and should be easy to implement.
Can we, though? I was under distinct impression that the current way it's done is intentional, and it is a design guarantee that the benchmark-under-test will run for at least
min_time
consecutive seconds, and we will report the measurements from that specific run.
it's intentional in that we built it that way. but like i said above, if efficiency had been a concern we probably would have built it differently. this was the simplest approach at the time that worked.
why do you think the consecutive aspect matters, or reporting from a single run matters? it shouldn't, right?
I don't think it is really relevant in terms of the report, but it is in terms of computation time.
Not only about the CPU time (and energy) spent without a real outcome but also the human time you need to wait to have the full report. In my case, I have to benchmark a lot of cases and functions, and the full set of benchmarks might take several hours to complete. Saving time would be significant.
For example, below you can see the output of a little example to prove my point. Here I store:
LastRunTime
: execution time of the last batch (after the last call to PredictNumItersNeeded()
). This number is approx Time * Iterations
.TimesCalled
: times that the benchmark was executed. E.g.: 8 would mean that we tried 8 different numbers for iters
within the library.TotalRunTime
: the total amount of time running each function. This is the summation of all the calls to VectorLookup
, not only the reported ones.You can see that for having a 1s benchmark (actually ~1.4s), we had to run each function up to 2.50s. That's a 150% overhead! Several times we had to run each function more than 2s. In my opinion, this overhead is far from negligible.
------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations LastRunTime TimesCalled TotalRunTime
------------------------------------------------------------------------------------------------
VectorLookup/10 504 ns 504 ns 2752537 1.38822 8 1.96039
VectorLookup/20 3217 ns 3217 ns 453119 1.45753 7 1.80181
VectorLookup/30 10233 ns 10227 ns 140387 1.43665 7 2.54488
VectorLookup/40 22553 ns 22548 ns 61905 1.39615 6 1.64711
VectorLookup/50 42935 ns 42933 ns 31745 1.36298 6 1.85481
VectorLookup/60 73800 ns 73790 ns 19437 1.43445 6 2.23461
VectorLookup/70 119281 ns 119281 ns 11924 1.42231 5 1.55267
VectorLookup/80 173258 ns 173253 ns 8347 1.44619 5 1.63244
VectorLookup/90 244639 ns 244630 ns 5709 1.39665 5 1.66971
VectorLookup/100 336135 ns 336133 ns 4196 1.41043 5 1.78169
VectorLookup/110 429649 ns 429649 ns 3255 1.39851 5 1.8789
VectorLookup/120 575516 ns 575507 ns 2514 1.44685 5 2.06573
VectorLookup/130 720009 ns 719941 ns 1956 1.40834 5 2.20416
VectorLookup/140 965566 ns 965355 ns 1560 1.50629 5 2.50385
VectorLookup/150 1517023 ns 1516741 ns 911 1.38201 4 1.55215
VectorLookup/160 1603279 ns 1603062 ns 856 1.37241 4 1.55395
VectorLookup/170 1655967 ns 1655238 ns 837 1.38605 4 1.57171
VectorLookup/180 1932482 ns 1932454 ns 720 1.39139 4 1.60685
VectorLookup/190 2259932 ns 2259914 ns 621 1.40342 4 1.65399
VectorLookup/200 2631596 ns 2631435 ns 532 1.40001 4 1.69203
As i have said, i'm still looking at the current approach, we may be able to do better without any drastic changes.
(We are currently at top-right corner of the bottom graph)
One thing that i haven't thought about is, there may be costly setup code
outside of the benchmarked section, and it's timings are not at all modelled here.
Without accounting for that, decreasing ITERATION_COUNT_MULTIPLIER
is perhaps a foodgun.
One thing that i haven't thought about is, there may be costly setup code outside of the benchmarked section, and it's timings are not at all modelled here. Without accounting for that, decreasing
ITERATION_COUNT_MULTIPLIER
is perhaps a foodgun.
@LebedevRI that's a very good point, in that case the aim should be to reduce the amount of times the function PredictNumItersNeeded()
is called, right?
I implemented this idea:
Instead of multiplying by 10 "blindly", what about making an estimation of the iterations to get to, let's say 15%? In that case, you "wasted" only ~15% of CPU time, but it is big enough to be considered "significant" in the current algorithm and make a better prediction later (reducing the 40% overestimation to something smaller).
3-phase approach to determine the number of iterations needed.
With the previous benchmark, it seems that the overhead is reduced from 50-150% to ~20%
------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations LastRunTime TimesCalled TotalRunTime
------------------------------------------------------------------------------------------------
VectorLookup/10 379 ns 379 ns 2755924 1.04467 8 1.2279
VectorLookup/20 2783 ns 2783 ns 361921 1.00727 7 1.1888
VectorLookup/30 8494 ns 8494 ns 123341 1.04762 7 1.29101
VectorLookup/40 20736 ns 20736 ns 49884 1.0344 6 1.20582
VectorLookup/50 39235 ns 39235 ns 27109 1.06364 6 1.25163
VectorLookup/60 68139 ns 68136 ns 15541 1.05895 6 1.28128
VectorLookup/70 104222 ns 104219 ns 9835 1.02502 5 1.19017
VectorLookup/80 160129 ns 160127 ns 6548 1.04853 5 1.21012
VectorLookup/90 218037 ns 218035 ns 4742 1.03394 5 1.20603
VectorLookup/100 305099 ns 305098 ns 3424 1.04466 5 1.22355
VectorLookup/110 401033 ns 401023 ns 2644 1.06033 5 1.25001
VectorLookup/120 540107 ns 540099 ns 2010 1.08562 5 1.28743
VectorLookup/130 652292 ns 652282 ns 1595 1.04041 5 1.26183
VectorLookup/140 835440 ns 835427 ns 1268 1.05934 5 1.29917
VectorLookup/150 1262388 ns 1262372 ns 808 1.02001 4 1.18841
VectorLookup/160 1380183 ns 1380157 ns 745 1.02824 4 1.18951
VectorLookup/170 1454011 ns 1453913 ns 707 1.02799 4 1.19032
VectorLookup/180 1740274 ns 1740252 ns 589 1.02502 4 1.19431
VectorLookup/190 2029587 ns 2029546 ns 510 1.03509 4 1.2063
VectorLookup/200 2393218 ns 2393183 ns 432 1.03387 4 1.20711
I would even say that step 1 could use the same kind of approach as the other steps to reduce the amount of times that the iterations have to be recalculated
Problem description In the function
PredictNumItersNeeded()
there is this1.4
correction factor. This causes the time running the experiment to exceed by ~40% the time specified by--benchmark_min_time
. Of course,--benchmark_min_time
denotes the minimum amount of time to run the benchmark, but an overrun of 40% seems excessive. This is particularly relevant in supercomputers, where CPU time is expensive.Suggested solution I suggest either removing the correction factor or making it configurable (with a default value of 1.0).
Example As shown in the following output (executed with
--benchmark_min_time=1s
) the real execution time is ~1.4s: $7039 \times 198481 = 1397107759$, $8775 \times 160984 = 1412634600$, ...When executing the same code with
--benchmark_min_time=0.71s
($1/1.4 \simeq 0.71$), the execution times are much closer to 1s: $7681 \times 134530 = 1033324930$, $9265 \times 103410 = 958093650$, ...