Open llbit opened 8 years ago
I believe the time measurement when using --meter time
is broken. I took a look at the rbench.py
script and I believe there is an error here:
elif meter == 'time':
metrics = {}
metrics['time'] = (bench_t - warmup_t) * 1000 / bench_rep
Note the bench_t - warmup_t
expression. I think it should actually be just bench_t
because neither of those variables is a timestamp. Both bench_t
and warmup_t
are computed using this pattern:
start_t = time.time()
...
end_t = time.time()
bench_t = end_t - start_t
The reported time measurement is thus just the difference between the running time for the measured benchmark runs and the warmup runs.
The reported times with --meter perf
may also be incorrect, since the same expression is used in that case too. I did not take a closer look at the perf
case though, it may be correct.
Looks like the perf
metrics are discarded. Line 217 seems to overwrite the perf
metrics by the faulty --meter time
metrics.
if meter == 'perf':
lines = [line.strip() for line in open(perf_tmp)]
metrics = process_perf_lines(lines)
#os.remove(perf_tmp)
metrics['time'] = (bench_t - warmup_t) * 1000 / bench_rep # <-- Line 217
EDIT: this is not an error, since perf does not record the 'time'
metric, line 217 just adds the 'time'
metric in addition to the perf metrics.
I tried running with --meter perf
and can confirm that the same error exists in this case too. I get negative metrics when running the scalar
benchmarks. Here is part of my output:
> read.csv('rbench.csv')[c('benchmark', 'time', 'page.faults', 'cpu.migrations')]
benchmark time page.faults cpu.migrations
1 crt.R 19.56 0.4 -1.2
2 fib.R 87.37 10.6 0.0
3 fib_rec.R 1684.49 0.0 0.0
4 ForLoopAdd.R 466.24 567.0 3.2
5 gcd.R 461.76 129.6 0.2
6 gcd_rec.R -0.59 0.0 1.6
7 prime.R 1030.33 -24.2 0.0
I misunderstood the purpose of diffing the bench_t
and warmup_t
times. The warmup_t
variable measures N warmup runs and the bench_t
variable measures N warmup runs plus M benchark runs, thus bench_t - warmup_t
is intended to give the time for running M benchmark runs only. However, this does not work as intended, as can be seen by the negative statistics in my examples above. The reason it does not work is that the running time is not the same for each run. N warmup runs will take different times to run for different invocations of the R benchmark. This seems to be a fundamental problem in the benchmarking method.
The pull request I submitted does not fix this problem, but at least it does not give negative measurements for --meter {time,perf}
. A similar fix should be done for --meter system.time
.
A workaround for this issue is to run with --warmup_rep 0
Since this issue only is a problem for short-running benchmarks, I think the preferred solution is to adjust --warmup_rep
and --bench_rep
for the short-running benchmarks to reduce the error margins for those benchmarks. My previous pull request removed the warmup runs from the main benchmarking run, but that means that the benchmark is more strongly affected by the warmup behaviour of the benchmarked RVM.
I think the fix to this issue is simply to document the negative metrics problem in the README so that users of the benchmark can understand the cause and possible solutions (either setting --warmup_rep 0
or increasing both --warmup_rep
and --bench_rep
for short-running benchmarks).
The recursive GCD benchmark is so rediculously fast that the RVM startup/shutdown time dominates benchmark running time. I had to increase bench_rep
to over 1000 before I started seeing consistent results.
I ran the
scalar/gcd/gcd_rec.R
benchmark, and it gave a negative running time using--rvm R
. Is this an error in the benchmark harness or the benchmark itself?