ha7ilm / csdr

A simple DSP library and command-line tool for Software Defined Radio.
523 stars 171 forks source link

Bug in fir_decimate_cc NEON implementation at higher decimation rates #15

Open ricovangenugten opened 8 years ago

ricovangenugten commented 8 years ago

I encountered this issue when trying to run OpenWebRX on an Odroid-U3. When running at a sample rate of 0.25 Msps everything works fine, while at a sample rate of 2.048 Msps the audio would drop out, even though the CPU usage was very low.

After some investigation this seems to be an issue with fir_decimate_cc, it started outputting zeros with higher decimation rates (185 at 2.048 Msps) after about 1 second of operation while it works fine with lower rates (22 at 0.25 Msps). When not using the NEON optimized implementation of fir_decimate_cc by removing -DNEON_OPTS in PARAMS_NEON this issue does not arise.

Without the optimized implementation the Odroid-U3 uses 22% CPU when sampling at 2.048 Msps with 1 client connected, so this workaround is (pretty much) usable.

This issue could of course be specific to the Odroid-U3, please report if anyone has the current NEON-implementation working using another board/cpu with NEON FPU, then the solution is to use the workaround when using an Odroid-U3. :)

ricovangenugten commented 8 years ago

After some more investigation I found that the NEON implementation outputs increasingly large values, then starts outputting -Inf and +Inf, and eventually outputs NaN after some time.

ha7ilm commented 8 years ago

Hi Rico,

Thanks for finding that! That's definitely a bug. The FIR filter should be stable. I will check it on my RPi when I'll have time.

73!

Andras

ricovangenugten commented 8 years ago

Hi Andras,

The RPi will use PARAMS_RASPI and therefore not the NEON implementation, right? Does the RPi even support NEON?

73, Rico PA3RVG

ha7ilm commented 8 years ago

I developed those routines on the RPi 2. It does have NEON.

Let's suppose that the problem happens when we write zero to the accumulators at the beginning of the loop:

        asm volatile(
            "       vmov.f32 q4, #0.0\n\t" //another way to null the accumulators
            "       vmov.f32 q5, #0.0\n\t"

I've just read that using the VMOV.F32 this way is only supported if the ARM SoC has VFPv3.

What is the output of cat /proc/cpuinfo on your system? Do you have vpfv3 in the flags on the Odroid-U3?

ricovangenugten commented 8 years ago

Jup, it does.

If the Raspi has NEON then why isn't it enabled for raspi in the Makefile?

On do 29 sep. 2016 at 21:27, András Retzler notifications@github.com wrote:

I developed those routines on the RPi 2. It does have NEON. Maybe something is different from yours.

Let's suppose that the problem happens when we write zero to the accumulators at the beginning of the loop:

    asm volatile(
        "       vmov.f32 q4, #0.0\n\t" //another way to null the accumulators
        "       vmov.f32 q5, #0.0\n\t"

I've just read http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/CJAEFGHE.html that using the VMOV.F32 this way is only supported if the ARM SoC has VFPv3.

What is the output of cat /proc/cpuinfo on your system? Do you have vpfv3 in the flags?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/simonyiszk/csdr/issues/15#issuecomment-250566592, or mute the thread https://github.com/notifications/unsubscribe-auth/AAonxCkKSaZg-xY_T0IYOOm5Lg86bLOYks5qvBEOgaJpZM4KJrGe .

ricovangenugten commented 8 years ago

Could it have sonething to do with the different amount of taps between the two implementations?

ha7ilm commented 8 years ago

If the Raspi has NEON then why isn't it enabled for raspi in the Makefile?

It is. PARAMS_RASPI is for the Raspberry Pi 1.

NEON is available in Raspberry Pi 2.

ricovangenugten commented 8 years ago

Ah ok, my bad. I'm curious if it works on your Pi using a higher sample rate. Let me know if you want me to test anything on the Odroid.

ricovangenugten commented 8 years ago

BTW, this is my cpuinfo, so the Oroid also has vfp and vfpv3.

`rico@odroid:~$ cat /proc/cpuinfo processor : 0 model name : ARMv7 Processor rev 0 (v7l) BogoMIPS : 3394.86 Features : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x3 CPU part : 0xc09 CPU revision : 0

processor : 1 model name : ARMv7 Processor rev 0 (v7l) BogoMIPS : 3394.86 Features : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x3 CPU part : 0xc09 CPU revision : 0

processor : 2 model name : ARMv7 Processor rev 0 (v7l) BogoMIPS : 3394.86 Features : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x3 CPU part : 0xc09 CPU revision : 0

processor : 3 model name : ARMv7 Processor rev 0 (v7l) BogoMIPS : 3394.86 Features : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x3 CPU part : 0xc09 CPU revision : 0

Hardware : ODROID-U2/U3 Revision : 0000 Serial : 0000000000000000`

ricovangenugten commented 8 years ago

I'm going to have a go at fixing it myself, should be a fun assembly exercise. Andras, could you explain why the NEON impl has a different amount of taps than the non-NEON impl? Is it because of alignment restrictions, and if so, what are they?

ricovangenugten commented 8 years ago

OK, I'm quite sure it has something to do with the zeroing of the accumulators quad_acciq. When I memset them to 0 before the assembly code it works. Lower CPU usage now, but I bet zeroing in assembly is still faster.

ha7ilm commented 8 years ago

OK, I'm quite sure it has something to do with the zeroing of the accumulators quadacciq.

Great!

could you explain why the NEON impl has a different amount of taps than the non-NEON impl?

The NEON instructions used here work on 4 pieces of 32 bit floats at once (called a quad register), so both the number of taps and the number of input samples have to be multiples of 4.

See my explanation of the algorithm over here: http://blog.sdr.hu/repos/Friedrichshafen-SDRA-2016-Talk-And-Paper/slides/index.html#/21 NEON starts from slide 26.

ricovangenugten commented 8 years ago

When I zero the accumulators in another (very indirect) way as suggested here using veor q4 q4 and veor q5 q5 it works perfectly for some reason.

Does this have any drawbacks as far as you can see?

ha7ilm commented 8 years ago

I think it is a very good solution.

ricovangenugten commented 8 years ago

Disregard that solution, it doesn't work. Still silent audio. Same when memset()ing quad_acciq to zero. Note to self: actually listen to the audio before drawing conclusions. sigh

So, the zeroing is not the problem, but the filter is unstable for some reason.

ricovangenugten commented 8 years ago

Next clue: When I use the same taps calculated for the NEON implementation with the non-NEON implementation the problem also occurs. So it seems that the NEON implementation itself is innocent, it is the filter design for the NEON impl that breaks.

ricovangenugten commented 8 years ago

OK, I finally found and fixed the issue, see commit 12d7db8. I've created a pull request.