fast_sin() is super slow

GoogleCodeExporter commented 9 years ago

fast_sin() uses a modified version of the algorithm by Nicolas Capens, but
the original version performs better !

I ran 10000000 iterations, and found:
fast_sin: 2.44414s
original: 0.939018s

Generated code:
* For "fast_sin":

        fmsr    s13, r0
        stmfd   sp!, {r7, lr} 
        fcvtds  d7, s13 
        fldd    d6, L17 
        fmuld   d7, d7, d6
        add     r7, sp, #0
        fstmfdx sp!, {d8}
        fcvtsd  s16, d7
        fmrs    r0, s16 
        bl      L_lroundf$stub
        fmsr    s14, r0 @ int 
        fsitos  s15, s14 
        fsubs   s13, s16, s15 
        flds    s14, L17+8      @ int 
        flds    s15, L17+12     @ int 
        fabss   s12, s13 
        fmacs   s14, s12, s15 
        fmuls   s15, s13, s14 
        fcpys   s13, s15 
        fabss   s14, s15 
        fmscs   s13, s14, s15 
        flds    s14, L17+16     @ int 
        tst     r0, #1
        fmacs   s15, s13, s14 
        fnegsne s15, s15 
        fmrs    r0, s15 
        sub     sp, r7, #12 
        fldmfdx sp!, {d8}
        sub     sp, r7, #0
        ldmfd   sp!, {r7, pc} 

L17:
        .long   1841940611
        .long   1070882608
        .long   1082130432
        .long   -1065353216
        .long   1046898278

* For the original version:

        fmsr    s15, r0
        flds    s14, L4 @ int
        fabss   s13, s15
        flds    s15, L4+4       @ int
        fnmacs  s14, s13, s15
        fmsr    s15, r0
        fmuls   s13, s15, s14
        flds    s15, L4+8       @ int
        flds    s14, L4+12      @ int
        fabss   s12, s13
        fmacs   s14, s12, s15
        fmuls   s15, s13, s14
        fmrs    r0, s15
        bx      lr

        .long   1067645315
        .long   1053786491
        .long   1046898278
        .long   1061578342

* Function body for the original:

static float sin_kernel(float x, float a, float b) {
        x = (1.27323954473516268615f - .40528473456935108577f*fabsf(x))*x;
        return x*(0.225f*fabsf(x) + 0.775f);
}
(using fabsf() makes gcc generate a call to fabss)

* CFLAGS: -O3 -fomit-frame-pointer -fstrict-aliasing -marm -march=armv6
-mcpu=arm1176jzf-s -mfloat-abi=softfp -mfpu=vfp

Original issue reported on code.google.com by julien.c...@gmail.com on 7 Apr 2009 at 2:32

GoogleCodeExporter commented 9 years ago

Erratum: Function body for the original:

static inline float sin_original(float x) {
        x = (1.27323954473516268615f - .40528473456935108577f*fabsf(x))*x;
        return x*(0.225f*fabsf(x) + 0.775f);
}

(bad copy/paste, sorry)

Original comment by julien.c...@gmail.com on 7 Apr 2009 at 2:34

GoogleCodeExporter commented 9 years ago

The actual fast_sin() routine was written to test the vector implementation, 
vsinf
against. No attempt at optimisation was made (probably should have called it
something other than fast_sin). Note that fast_sin() also performs range 
reduction,
which your code omits. Looks to me like abs() is resulting in fabss 
instructions, and
that round() is calling out into library code, which is probably the worst
performance offender. 

You might also want to try -ftree-vectorize in your CFLAGS.

Original comment by damien.m...@gmail.com on 7 Apr 2009 at 3:50

GoogleCodeExporter commented 9 years ago

I tried to use -ftree-vectorize, but it does nothing since VFP is currently not
supported by the tree vectorizer (only Neon is, AFAIK) : SIMD_UNITS_PER_WORD is 
not
defined on this target.

Original comment by julien.c...@gmail.com on 7 Apr 2009 at 4:03

GoogleCodeExporter commented 9 years ago

My understanding is that that SIMD_UNITS_PER_WORD message is a warning, and 
results
in a vector size of 1 being chosen. I have seen performance improvements 
turning it
on, but on larger bodies of code, and I have spoken with others who swear by 
it. I
cant say I have examined the asm output closely enough to detect what the 
difference is.

Original comment by damien.m...@gmail.com on 7 Apr 2009 at 7:28

GoogleCodeExporter commented 9 years ago

What difference is there between a scalar and a vector whose size is 1?
Turning it on/off produces the exact same assembly, in my tests.

Original comment by julien.c...@gmail.com on 7 Apr 2009 at 8:57

jovetri2014 / vfpmathlibrary

fast_sin() is super slow #3