benchmarks: No way to get reproducible results

olajep commented 9 years ago

Variance between runs is way too high.

I guess we could wrap the function in a for loop and use the lowest measurement, that would certainly improve things.

But I think we better use performance counters instead. Epiphany isn't affected by this since we already use CTIMERs there.

PAPI seems to have the cross-platform support we need: http://icl.cs.utk.edu/papi/

lfochamon commented 9 years ago

@olajep What do you think about timing several runs of the functions? Most of them run in the ns, maybe running them 1000 times (for example) would improve the precision of the measurements. Also, running the benchmark a few times is a good idea to get some statistics. Maybe report minimum, median and maximum run times. What I mean is something like:

for (i = 0; i < 100; i++) {
    item_preface(&data[i], ...);
    for (j = 0; j < 1000; j++) {
        fun()
    }
    item_done(&data[i], ...);
}

PAPI looks cool... Maybe we could take only the ARM/x86 parts (I see they don't support Windows anymore, not sure if that would be an issue).

olajep commented 9 years ago

@lchamon

On 2015-07-22 02:51, Chamon wrote:

@olajep https://github.com/olajep What do you think about timing several runs of the functions? Most of them run in the ns, maybe running them 1000 times (for example) would improve the precision of the measurements.

Yes, that will certainly improve things. I did some testing now and even with 32000 iterations there can be a 20 percent diff between two runs (most are pretty close however). Which is a lot better than before (>100% !). And 32000 iterations takes way too long for several functions. Benchmarking one of the image functions takes almost two minutes.

Also, running the benchmark a few times is a good idea to get some statistics. Maybe report minimum, median and maximum run times.

I'm not so sure about that, we should only care about the lowest measurement. Everything else is noise (e.g., context switches, some other process evicting our data from L2 cache ...).

What I mean is something like:

for (i = 0; i < 100; i++) { item_preface(&data[i], ...); for (j = 0; j < 1000; j++) { fun() } item_done(&data[i], ...); } Yup, we need the loop, no question about that. The question is how many iterations.

PAPI looks cool... Maybe we could take only the ARM/x86 parts (I see they don't support Windows anymore, not sure if that would be an issue).

I believe that benchmarking using clock cycles instead will give more stable results. We also need to size the benchmarks so they: i) fit in cache and (we don't want to benchmark the memory subsystem) and ii) one function call should be short enough to not always be preempted by the kernel.

I think that should do it.

Cheers, Ola

lfochamon commented 9 years ago

@olajep Hmmm... I thought you could maybe limit the time instead of the number of iterations. But it might complicate things more than it solves, don't know. My idea was (in retrospect, maybe not a really good one):

volatile int i = 0; /* So the compiler doesn't optimize the loop */
volatile int j = 0;

item_preface(&data, item);
while (data->end - data->start < MAX_TIME) {
    item->benchmark(&spec);
    item_done(&data, &spec, item->name);
    i++;
}

loop_time = platform_clock();
while (i - j > 0)
    j++;
loop_time -= platform_clock();

data->end -= loop_time;

Maybe clock count is the way to go.

mansourmoufid commented 9 years ago

The higher the resolution of the timing, the less measurements you need to make. The Parallella has performance counters, right? And PAPI supports Linux on ARM. So it sounds like the right solution. :+1:

mansourmoufid commented 9 years ago

On ARM, PAPI uses the perf subsystem of the Linux kernel. If you want, you can use perf directly. The SUPERCOP benchmark software does this (look in the file supercop-20141124/cpucycles/perfevent.c). But the perf API is not as nice as PAPI.

lfochamon commented 9 years ago

The Linux/ARM timers from PAPI appears to use clock_gettime or gettimeofday (depending on what's available). Clock cycles are estimated multiplying the time in usec by the clock frequency (viz. linux_timer.c lines 288 and 260). I could be wrong (someone should check), but at least in this case it doesn't seem to help much. Reading the PMU from ARM can't be done from user space. x86 provides assembly time stamp reading (rdtsc) and that's what PAPI uses.

I guess for ARM there is no easy way around gettimeofday/clock_gettime or perf_event. I guess only testing to see if there's a big difference...

mansourmoufid commented 9 years ago

I haven't used PAPI yet, but it should be possible to check at run-time if it supports hardware counters by checking if PAPI_num_counters returns a negative or zero. But first you need perf support in the kernel, on Debian you get that with the linux-tools-$(uname -r) package. I'll try it tomorrow.

lfochamon commented 9 years ago

@eliteraspberries I haven't used it either, take what I say with a grain of salt. I was checking out the source to see how they would be accessing the ARM performance counter from the user space, but it seems they're not (like you said, only compiling and running num_counters to be sure).

To me, perf appears to be the way to go on ARM/Linux (using perf_event_open as per http://web.eece.maine.edu/~vweaver/projects/perf_events/programming.html) and rdtsc inline assembly for x86. I'll see if I can contribute something more concrete, but I'm completely swamped for the next 2 weeks.

Adamszk commented 9 years ago

I measure my code performance such as speed and the timing is occasionally off. I believe 10 iteration is minimum and 100 is enough, 1000 is a bit too much. An average or mean with plus and minus deviation will suffice to have one column of data for speed. Below is example code in c++ I used to time my code performance: start = std::clock(); float answer1= erff(w); // standard algorithm here duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;

lfochamon commented 9 years ago

I have tested an inline assembly solution for x86 processors that have rdtsc. To get some statistics, I timed 3 functions (10 ms, 100 ms, and 1 s) using clock(), rdtsc, and QueryPerformanceCounter() from the Windows API (I'm on Windows, so no gettime()). All statistics are calculated over 200 independent measurements, with compiler flag O3, and are summarized in the table and plots below.

All methods hit the same average measurement (some with more precision than others). For fast functions, the variance is considerably better using rdtsc or QueryPerformanceCounter(), which suggest that maybe using the perf_event_open method on Linux we could avoid the inline assembly altogether. I'll try to get back to you guys in a few weeks with tests on that (if no one has done it before). When functions hit the 1 s mark, though, the measurement variances are basically the same. The precision on sub-microsecond measurements is not great, but I haven't tested tens of microseconds yet.

@olajep I pushed the solution using rdtsc (the method used by PAPI) to a branch on my fork (benchmark.c), but I'm still new to Autotools I have no idea how to use autoconf macros to define the __x86_64__ (64 bits processor), __i386__ (32 bits processor), and CPU_FREQ (CPU clock frequency).

Length	Method	Mean	Variance (ratio to `rdtsc`)
1000ms	clock()	1.60234500	1.520198e-05 (1.748116)
1000ms	rdtsc	1.60229626	8.696210e-06
1000ms	WinQPC	1.60208964	1.735461e-05 (1.9956522)
100ms	clock()	0.16021000	3.433065e-06 (6.507617)
100ms	rdtsc	0.16018014	5.275457e-07
100ms	WinQPC	0.16020998	6.624749e-07 (1.2557677)
10ms	clock()	0.01605000	4.168342e-06 (37.581361)
10ms	rdtsc	0.01610825	1.109151e-07
10ms	WinQPC	0.01608080	6.404925e-08 (0.5774617)

rplot

parallella / pal

benchmarks: No way to get reproducible results #203