FIR-Filter cycle counts: C-version faster than NEON-version

mdupuy commented 8 years ago

Adding issue from this thread: https://community.arm.com/thread/9328

i'm currently trying to measure cycle counts for FIR-filtering with the NE10 library. I'm using a Raspberry Pi 2 with ARM Cortex-A7 running on Raspbian as a target. I activated the Cortex-A7 performance counter register to read out the cycles before and after the filter-execution.

Now i tested both functions "ne10_fir_float_neon()" and "ne10_fir_float_c()" and expected the NEON-Assembly version to be faster than the C version. To my surprise i seem to get better results with the plain C version. I checked with different Blocksizes and Filter-lengths but in all my tests the C-only version has a smaller cycle count.

For example using a blocksize of 128 and 21 filter-taps i get this results:

using ne10_fir_float_neon(): average of 10212 cycles which is ~3.8 cycles per sample per tap

using ne10_fir_float_c(): average of 8436 cycles which is ~3.1 cycles per sample per tap

Is there a reason why the NEON version is slower than the C version on the Cortex A-7 and could that be different on a different target, say Cortex A-9? Or could there be something wrong with my measurements and the NEON version should always be faster? Or is it only faster for specific blocksizes and filter-lengths?

Or maybe i did something wrong and i have to activate NEON correctly? I used "ne10_init()" and "ne10_HasNEON()" returns "NE10_OK". So this should be fine...

---- More info ---- I'm using the gcc version 4.9.2 on Raspbian Jessie. I tried several different compiler flags and also the default ones. In all combinations the C-version was faster than the NEON-version. With some help of (ARM Cortex-A Processors and GCC Command Lines ) i got the best results by using

"-mcpu=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard -ffast-math -funsafe-math-optimizations -funroll-loops -O3"

My measurements from this morning with these flags (blocksize 128 and 21 filter-taps) are:

using ne10_fir_float_neon(): average of 10130 cycles which is ~3.8 cycles per sample per tap using ne10_fir_float_c() : average of 8364 cycles which is ~3.1 cycles per sample per tap

SudarshanRaj commented 8 years ago

Hi Matthew,

We just started working on SIMD implementation of few DSP algorithm blocks using ARM NEON (NE10 lib) and just came across this post. Any idea on the status of this issue? Was the issue reproducible? If so, is it resolved yet?

Thanks, Sud

ghost commented 8 years ago

After seeing this issue I benchmarked this on our development board. With an i.MX 7 I see about a 15% slower performance of the NEON implementation versus the C implementation. We are sourcing the library from https://layers.openembedded.org/layerindex/recipe/46479/

Tom

projectNe10 / Ne10

FIR-Filter cycle counts: C-version faster than NEON-version #127