Some benchmarks should be deleted

jpallister commented 10 years ago

Some of the benchmarks don't do much and should be deleted.

Candidates for deletion:

fibcall
gdb-*
lcdnum
prime
ns

T-J-Teru commented 10 years ago

I wonder about the criteria for deletion... As a starting point, on x86-64 I ran each test with instruction counting, and got the following graph: icounts

This shows the number of instructions executed for each test. The gdb-* tests are all low, but then so are a lot of others.

What this graph does not currently take into account is the number of iterations.

jpallister commented 10 years ago

Nice graph - there is some really large variation there...

My main thought was that these benchmarks don't seem to be representative what would be done on an embedded system. Especially the GDB tests, these seem to be more for finding bugs in GDB rather than benchmarking.

Also some of these tests are so short that we end up benchmarking more of the loop overhead than benchmark, e.g. lcdnum. The ns benchmarks seems to be microbenchmarking multidimensional arrays, too.

Prime and fibcall - perhaps they should stay?

T-J-Teru commented 10 years ago

Following on from my previous graph, I have two new ones. These new graphs also include information on the number of times the benchmark function is run, though the instruction count is still the total over the whole execution, my hope is that the startup and shutdown overhead is minimal compared to the benchmark loop.

The first graph shows instructions per iteration of the benchmark function: ipi

The next graph is really the same as the original graph I posted, except that the order of the tests is the same as in the instructions per iteration graph. icounts

As for how this helps with selecting deletion candidates, I agree that the gdb-* tests are particularly artificial and should probably go, I'll get that done. After that, if we are concerned with a low work:loop ratio then the first graph would suggest an obvious set of candidates for deletion.

jpallister commented 10 years ago

Looks good, what are you using to plot the graphs? Gnuplot?

It would be interesting to see if the same work:loop ratio applies for other architectures too.

T-J-Teru commented 10 years ago

I'm using R (ggplot2 library) for drawing the graphs.

My initial interest was trying to "balance" out the benchmarks based on instructions executed. It's not going to be perfect, but things are so unbalanced at the moment that just getting the instruction totals closer together feels like it should make things a little better.

Currently my world looks like this: balance

The anomaly is bubblesort, the only difference between bubblesort and bubblesort_quick is the number of iterations performed. I think we should remove bubblesort_quick and balance bubblesort with the rest of the tests.

There's also a long tail of tests that are running the current maximum of 4096 iterations, but are nowhere near the average instruction count.

T-J-Teru commented 10 years ago

Last look at this instruction count data for a while. This time I wanted to get a better feel for the per-iteration instruction count, and try to estimate the startup instruction count. I modified each test, running the test binarys 3 times each. The first time I did 5,000 benchmark iterations, the next 10,000 iterations, and the final time 15,000 iterations.

Then I compared the instruction count for 5k, 10k, and 15k iterations in order to establish an estimate for instructions per iteration, then using this figure I estimate the startup cost.

This graph shows the instructions per iteration, it is broadly in line with the previous graph I posted for instructions per iteration, even though the previous graph did not try to factor out the startup cost: periteration

The next graph shows the startup cost for each test: startupcost

Notice compress and whetstone seem to have unusually high startup costs. Also notice that sha appears to have a negative startup cost. I believe that the negative number reflects that the figures in both these graphs are just estimates, based on increasing the number of iterations by 5000. The instructions per iteration is (obviously) not constant, and so in the case of sha we are over estimating the cost per iteration. Getting more reliable figures would require substantially more data collection.

EDIT: The strange behaviour of compress is a bug, fixed in 3490aa4dfe7d01d593cf167c78528ce0fd075ecf.

mageec / beebs

Some benchmarks should be deleted #7