optimisation suggestion

GoogleCodeExporter commented 9 years ago

In the Vector-Matrix multiply code, an 8 cycle pipeline latency isnt being 
exploited between the 
last fadds and the fstmias. While waiting for the results of the fadds 
operation, you can load up 
the next vector and do the rest of the housekeeping. You will need to be in 
RunFast mode to do 
this, otherwise the s0-s2 registers will be source-locked. In addition, its 
worth throwing a PLD 
%0,#32 instruction in there to preload the next cache line.

The main question is - are you prepared to use RunFast mode? If not, you can 
unroll the loop 
and load into the s3-s5 registers.

My guess is that

                VFP_VECTOR_LENGTH(3)
                "fldmias  %0, {s0-s2}      \n\t"
                "adds %0, %0, %3           \n\t"
                "subs %1, %1, %4           \n\t"

                "L2000:                    \n\t"

                // First column times matrix.
                "fmuls s24, s8, s0         \n\t"
                "fmacs s24, s12, s1        \n\t"
                "fmacs s24, s16, s2        \n\t"
                "fadds s24, s24, s20       \n\t"

                "fldmias  %0, {s0-s2}      \n\t"
                "adds %0, %0, %3           \n\t"
                "pld %0, #32         \n\t"
                "adds %1, %1, %4           \n\t"
                "subs %5, %5, #1           \n\t"

                // Save vector.
                "fstmias  %1, {s24-s27}   \n\t" 

                "bne L2000                 \n\t"

                VFP_VECTOR_LENGTH_ZERO

Original issue reported on code.google.com by damien.m...@gmail.com on 12 Mar 2009 at 7:55

GoogleCodeExporter commented 9 years ago

this cpu uses TrustZone, so PLD in user code is equivalent to a NOP.

Original comment by julien.c...@gmail.com on 7 Apr 2009 at 2:46

GoogleCodeExporter commented 9 years ago

Also, RunFast is the default on iPhoneOS, as on QNX.

Original comment by julien.c...@gmail.com on 7 Apr 2009 at 2:46

GoogleCodeExporter commented 9 years ago

I thought the same about PLD instructions (it what it sais in the manual), but 
others
have reported performance increases when using it. Seems odd that the security
subsystem would prevent preloading cache lines. Warrants testing.

Are you sure about RunFast on iPhone? Whats your source for that?

Original comment by damien.m...@gmail.com on 7 Apr 2009 at 3:36

GoogleCodeExporter commented 9 years ago

My reading on the ARM Architecture Reference Manual ( 
http://www.arm.com/miscPDFs/14128.pdf ), pages C2-
26 and C2-10 indicate that the cumulative exception bits are for reading to see 
if there was an exception or not 
since you last cleared them. My understanding is that clearing them is not 
required to get to runfast mode, you 
just need to clear the enable bits, which are already cleared by default. So, 
solely on the material I read, I agree 
with comment 2 by Julien.

Original comment by ala...@gmail.com on 3 May 2009 at 5:09

jamesmintram / vfpmathlibrary

optimisation suggestion #2