Negative times for gcd_rec.R

llbit commented 8 years ago

I ran the scalar/gcd/gcd_rec.R benchmark, and it gave a negative running time using --rvm R. Is this an error in the benchmark harness or the benchmark itself?

> data.frame(cr$benchmark, cr$time)
  cr.benchmark cr.time
1        crt.R   63.89
2        fib.R  736.46
3    fib_rec.R 3334.30
4 ForLoopAdd.R 4765.90
5        gcd.R 2068.37
6    gcd_rec.R   -1.04
7      prime.R 4694.61

llbit commented 8 years ago

I believe the time measurement when using --meter time is broken. I took a look at the rbench.py script and I believe there is an error here:

    elif meter == 'time':
        metrics = {}
        metrics['time'] = (bench_t - warmup_t) * 1000 / bench_rep

Note the bench_t - warmup_t expression. I think it should actually be just bench_t because neither of those variables is a timestamp. Both bench_t and warmup_t are computed using this pattern:

start_t = time.time()
...
end_t = time.time()
bench_t = end_t - start_t

The reported time measurement is thus just the difference between the running time for the measured benchmark runs and the warmup runs.

llbit commented 8 years ago

The reported times with --meter perf may also be incorrect, since the same expression is used in that case too. I did not take a closer look at the perf case though, it may be correct.

llbit commented 8 years ago

Looks like the perf metrics are discarded. Line 217 seems to overwrite the perf metrics by the faulty --meter time metrics.

    if meter == 'perf':
        lines = [line.strip() for line in open(perf_tmp)]
        metrics = process_perf_lines(lines)
        #os.remove(perf_tmp)
        metrics['time'] = (bench_t - warmup_t) * 1000 / bench_rep # <-- Line 217

EDIT: this is not an error, since perf does not record the 'time' metric, line 217 just adds the 'time' metric in addition to the perf metrics.

llbit commented 8 years ago

I tried running with --meter perf and can confirm that the same error exists in this case too. I get negative metrics when running the scalar benchmarks. Here is part of my output:

> read.csv('rbench.csv')[c('benchmark', 'time', 'page.faults', 'cpu.migrations')]
     benchmark    time page.faults cpu.migrations
1        crt.R   19.56         0.4           -1.2
2        fib.R   87.37        10.6            0.0
3    fib_rec.R 1684.49         0.0            0.0
4 ForLoopAdd.R  466.24       567.0            3.2
5        gcd.R  461.76       129.6            0.2
6    gcd_rec.R   -0.59         0.0            1.6
7      prime.R 1030.33       -24.2            0.0

llbit commented 8 years ago

I misunderstood the purpose of diffing the bench_t and warmup_t times. The warmup_t variable measures N warmup runs and the bench_t variable measures N warmup runs plus M benchark runs, thus bench_t - warmup_t is intended to give the time for running M benchmark runs only. However, this does not work as intended, as can be seen by the negative statistics in my examples above. The reason it does not work is that the running time is not the same for each run. N warmup runs will take different times to run for different invocations of the R benchmark. This seems to be a fundamental problem in the benchmarking method.

The pull request I submitted does not fix this problem, but at least it does not give negative measurements for --meter {time,perf}. A similar fix should be done for --meter system.time.

llbit commented 8 years ago

A workaround for this issue is to run with --warmup_rep 0

llbit commented 8 years ago

Since this issue only is a problem for short-running benchmarks, I think the preferred solution is to adjust --warmup_rep and --bench_rep for the short-running benchmarks to reduce the error margins for those benchmarks. My previous pull request removed the warmup runs from the main benchmarking run, but that means that the benchmark is more strongly affected by the warmup behaviour of the benchmarked RVM.

I think the fix to this issue is simply to document the negative metrics problem in the README so that users of the benchmark can understand the cause and possible solutions (either setting --warmup_rep 0 or increasing both --warmup_rep and --bench_rep for short-running benchmarks).

llbit commented 8 years ago

The recursive GCD benchmark is so rediculously fast that the RVM startup/shutdown time dominates benchmark running time. I had to increase bench_rep to over 1000 before I started seeing consistent results.

rbenchmark / benchmarks

Negative times for gcd_rec.R #5