Open ricovangenugten opened 8 years ago
After some more investigation I found that the NEON implementation outputs increasingly large values, then starts outputting -Inf and +Inf, and eventually outputs NaN after some time.
Hi Rico,
Thanks for finding that! That's definitely a bug. The FIR filter should be stable. I will check it on my RPi when I'll have time.
73!
Andras
Hi Andras,
The RPi will use PARAMS_RASPI and therefore not the NEON implementation, right? Does the RPi even support NEON?
73, Rico PA3RVG
I developed those routines on the RPi 2. It does have NEON.
Let's suppose that the problem happens when we write zero to the accumulators at the beginning of the loop:
asm volatile(
" vmov.f32 q4, #0.0\n\t" //another way to null the accumulators
" vmov.f32 q5, #0.0\n\t"
I've just read that using the VMOV.F32 this way is only supported if the ARM SoC has VFPv3.
What is the output of cat /proc/cpuinfo
on your system? Do you have vpfv3
in the flags on the Odroid-U3?
Jup, it does.
If the Raspi has NEON then why isn't it enabled for raspi in the Makefile?
On do 29 sep. 2016 at 21:27, András Retzler notifications@github.com wrote:
I developed those routines on the RPi 2. It does have NEON. Maybe something is different from yours.
Let's suppose that the problem happens when we write zero to the accumulators at the beginning of the loop:
asm volatile( " vmov.f32 q4, #0.0\n\t" //another way to null the accumulators " vmov.f32 q5, #0.0\n\t"
I've just read http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/CJAEFGHE.html that using the VMOV.F32 this way is only supported if the ARM SoC has VFPv3.
What is the output of cat /proc/cpuinfo on your system? Do you have vpfv3 in the flags?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/simonyiszk/csdr/issues/15#issuecomment-250566592, or mute the thread https://github.com/notifications/unsubscribe-auth/AAonxCkKSaZg-xY_T0IYOOm5Lg86bLOYks5qvBEOgaJpZM4KJrGe .
Could it have sonething to do with the different amount of taps between the two implementations?
If the Raspi has NEON then why isn't it enabled for raspi in the Makefile?
It is. PARAMS_RASPI
is for the Raspberry Pi 1.
NEON is available in Raspberry Pi 2.
Ah ok, my bad. I'm curious if it works on your Pi using a higher sample rate. Let me know if you want me to test anything on the Odroid.
BTW, this is my cpuinfo, so the Oroid also has vfp and vfpv3.
`rico@odroid:~$ cat /proc/cpuinfo processor : 0 model name : ARMv7 Processor rev 0 (v7l) BogoMIPS : 3394.86 Features : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x3 CPU part : 0xc09 CPU revision : 0
processor : 1 model name : ARMv7 Processor rev 0 (v7l) BogoMIPS : 3394.86 Features : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x3 CPU part : 0xc09 CPU revision : 0
processor : 2 model name : ARMv7 Processor rev 0 (v7l) BogoMIPS : 3394.86 Features : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x3 CPU part : 0xc09 CPU revision : 0
processor : 3 model name : ARMv7 Processor rev 0 (v7l) BogoMIPS : 3394.86 Features : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x3 CPU part : 0xc09 CPU revision : 0
Hardware : ODROID-U2/U3 Revision : 0000 Serial : 0000000000000000`
I'm going to have a go at fixing it myself, should be a fun assembly exercise. Andras, could you explain why the NEON impl has a different amount of taps than the non-NEON impl? Is it because of alignment restrictions, and if so, what are they?
OK, I'm quite sure it has something to do with the zeroing of the accumulators quad_acciq
. When I memset them to 0 before the assembly code it works. Lower CPU usage now, but I bet zeroing in assembly is still faster.
OK, I'm quite sure it has something to do with the zeroing of the accumulators quadacciq.
Great!
could you explain why the NEON impl has a different amount of taps than the non-NEON impl?
The NEON instructions used here work on 4 pieces of 32 bit floats at once (called a quad register), so both the number of taps and the number of input samples have to be multiples of 4.
See my explanation of the algorithm over here: http://blog.sdr.hu/repos/Friedrichshafen-SDRA-2016-Talk-And-Paper/slides/index.html#/21 NEON starts from slide 26.
When I zero the accumulators in another (very indirect) way as suggested here using veor q4 q4
and veor q5 q5
it works perfectly for some reason.
Does this have any drawbacks as far as you can see?
I think it is a very good solution.
Disregard that solution, it doesn't work. Still silent audio. Same when memset()
ing quad_acciq
to zero. Note to self: actually listen to the audio before drawing conclusions. sigh
So, the zeroing is not the problem, but the filter is unstable for some reason.
Next clue: When I use the same taps calculated for the NEON implementation with the non-NEON implementation the problem also occurs. So it seems that the NEON implementation itself is innocent, it is the filter design for the NEON impl that breaks.
OK, I finally found and fixed the issue, see commit 12d7db8. I've created a pull request.
I encountered this issue when trying to run OpenWebRX on an Odroid-U3. When running at a sample rate of 0.25 Msps everything works fine, while at a sample rate of 2.048 Msps the audio would drop out, even though the CPU usage was very low.
After some investigation this seems to be an issue with fir_decimate_cc, it started outputting zeros with higher decimation rates (185 at 2.048 Msps) after about 1 second of operation while it works fine with lower rates (22 at 0.25 Msps). When not using the NEON optimized implementation of fir_decimate_cc by removing -DNEON_OPTS in PARAMS_NEON this issue does not arise.
Without the optimized implementation the Odroid-U3 uses 22% CPU when sampling at 2.048 Msps with 1 client connected, so this workaround is (pretty much) usable.
This issue could of course be specific to the Odroid-U3, please report if anyone has the current NEON-implementation working using another board/cpu with NEON FPU, then the solution is to use the workaround when using an Odroid-U3. :)