Open jpallister opened 10 years ago
I wonder about the criteria for deletion... As a starting point, on x86-64 I ran each test with instruction counting, and got the following graph:
This shows the number of instructions executed for each test. The gdb-* tests are all low, but then so are a lot of others.
What this graph does not currently take into account is the number of iterations.
Nice graph - there is some really large variation there...
My main thought was that these benchmarks don't seem to be representative what would be done on an embedded system. Especially the GDB tests, these seem to be more for finding bugs in GDB rather than benchmarking.
Also some of these tests are so short that we end up benchmarking more of the loop overhead than benchmark, e.g. lcdnum. The ns benchmarks seems to be microbenchmarking multidimensional arrays, too.
Prime and fibcall - perhaps they should stay?
Following on from my previous graph, I have two new ones. These new graphs also include information on the number of times the benchmark function is run, though the instruction count is still the total over the whole execution, my hope is that the startup and shutdown overhead is minimal compared to the benchmark loop.
The first graph shows instructions per iteration of the benchmark function:
The next graph is really the same as the original graph I posted, except that the order of the tests is the same as in the instructions per iteration graph.
As for how this helps with selecting deletion candidates, I agree that the gdb-* tests are particularly artificial and should probably go, I'll get that done. After that, if we are concerned with a low work:loop ratio then the first graph would suggest an obvious set of candidates for deletion.
Looks good, what are you using to plot the graphs? Gnuplot?
It would be interesting to see if the same work:loop ratio applies for other architectures too.
I'm using R (ggplot2 library) for drawing the graphs.
My initial interest was trying to "balance" out the benchmarks based on instructions executed. It's not going to be perfect, but things are so unbalanced at the moment that just getting the instruction totals closer together feels like it should make things a little better.
Currently my world looks like this:
The anomaly is bubblesort, the only difference between bubblesort and bubblesort_quick is the number of iterations performed. I think we should remove bubblesort_quick and balance bubblesort with the rest of the tests.
There's also a long tail of tests that are running the current maximum of 4096 iterations, but are nowhere near the average instruction count.
Last look at this instruction count data for a while. This time I wanted to get a better feel for the per-iteration instruction count, and try to estimate the startup instruction count. I modified each test, running the test binarys 3 times each. The first time I did 5,000 benchmark iterations, the next 10,000 iterations, and the final time 15,000 iterations.
Then I compared the instruction count for 5k, 10k, and 15k iterations in order to establish an estimate for instructions per iteration, then using this figure I estimate the startup cost.
This graph shows the instructions per iteration, it is broadly in line with the previous graph I posted for instructions per iteration, even though the previous graph did not try to factor out the startup cost:
The next graph shows the startup cost for each test:
Notice compress and whetstone seem to have unusually high startup costs. Also notice that sha appears to have a negative startup cost. I believe that the negative number reflects that the figures in both these graphs are just estimates, based on increasing the number of iterations by 5000. The instructions per iteration is (obviously) not constant, and so in the case of sha we are over estimating the cost per iteration. Getting more reliable figures would require substantially more data collection.
EDIT: The strange behaviour of compress is a bug, fixed in 3490aa4dfe7d01d593cf167c78528ce0fd075ecf.
Some of the benchmarks don't do much and should be deleted.
Candidates for deletion: