Open GoogleCodeExporter opened 9 years ago
There are only 2 people on this project at the minute since it's very new. I've
implemented all the 32-bit stuff so far, but I only have so much time to work
on it amongst full-time work many other projects and commitments. If you want
better/faster algorithms and are interested in implementing them then I'd be
happy to give you committer rights.
As the division goes, division will always be a slow operation, there aren't
that many optimizations I can think to make to the algorithm I used (not
without assembler anyway) but that doesn't mean that there isn't a
fundamentally better algorithm out there that could be used.
The inaccuracy must be due to a small error somewhere in the code, the fact
that it's such a small error indicates that it's likely to be a rounding error
or an error in some extreme case where the result overflows.
Original comment by Flatmush@googlemail.com
on 3 Mar 2011 at 3:35
I'd be interested to hear what you're using the library for birdland (if you
can tell me), I might be able to help you achieve what you want better that way
with the limited time I have to work on this project.
Original comment by Flatmush@googlemail.com
on 3 Mar 2011 at 3:45
hi
I need these fixed point functions in the baseband processing of some wireless
communication protocol such as WLAN channel estimation. Timing rquirements are
very strict. Please run the following code and see how many cycles division is
taking. Please which processor you are runing on. Please tell me also what
value of d you get ?
#include <stdio.h>
#include <stdlib.h>
typedef __int32_t fix16_t;
fix16_t fix16_div(fix16_t inArg0, fix16_t inArg1);
fix16_t fix16_mul(fix16_t inArg0, fix16_t inArg1);
volatile int d;
int main(){
int x = 5063265; //77.2593;
int y = -6338360;//-96.7157;
int a;
d = fix16_div(x, y);
printf("%d \n",d);
//printf("%d \n",a);
}
fix16_t fix16_div(fix16_t inArg0, fix16_t inArg1) {
int neg = ((inArg0 < 0) != (inArg1 < 0));
inArg0 = (inArg0 < 0 ? -inArg0 : inArg0);
inArg1 = (inArg1 < 0 ? -inArg1 : inArg1);
while(((inArg0 | inArg1) & 1) == 0) {
inArg0 >>= 1;
inArg1 >>= 1;
}
__uint32_t r_hi = (inArg0 / inArg1);
__uint32_t n_lo = (inArg0 % inArg1);
__uint32_t n_hi = (n_lo >> 16);
n_lo <<= 16;
__uint32_t i, arg;
for(i = 1, arg = inArg1; ((n_lo | arg) & 1) == 0; i <<= 1) {
n_lo = ((n_lo >> 1) | (n_hi << 31));
n_hi = (n_hi >> 1);
arg >>= 1;
}
__uint32_t res = 0;
if(n_hi) {
__uint32_t arg_lo, arg_hi;
for(arg_lo = inArg1; (arg_lo >> 31) == 0; arg_lo <<= 1, i <<= 1);
for(arg_hi = (arg_lo >> 31), arg_lo <<= 1, i <<= 1; arg_hi < n_hi; arg_hi = (arg_hi << 1) | (arg_lo >> 31), arg_lo <<= 1, i <<= 1);
do {
arg_lo = (arg_lo >> 1) | (arg_hi << 31);
arg_hi = (arg_hi >> 1);
i >>= 1;
if(arg_hi < n_hi) {
n_hi -= arg_hi;
if(arg_lo > n_lo)
n_hi--;
n_lo -= arg_lo;
res += i;
} else if((arg_hi == n_hi) && (arg_lo <= n_lo)) {
n_hi -= arg_hi;
n_lo -= arg_lo;
res += i;
}
} while(n_hi);
}
res += (n_lo / inArg1);
#ifndef FIXMATH_NO_ROUNDING
if((n_lo % inArg1) >= (inArg1 >> 1))
res++;
#endif
res += (r_hi << 16);
return (neg ? -res : res);
}
fix16_t fix16_mul(fix16_t inArg0, fix16_t inArg1) {
__int16_t hi[2] = { (inArg0 >> 16), (inArg1 >> 16) };
__uint16_t lo[2] = { (inArg0 & 0xFFFF), (inArg1 & 0xFFFF) };
__int32_t r_hi = hi[0] * hi[1];
__int32_t r_md = (hi[0] * lo[1]) + (hi[1] * lo[0]);
__uint32_t r_lo = lo[0] * lo[1];
#ifndef FIXMATH_NO_ROUNDING
r_lo += 0xFFFF;
#endif
r_md += (r_hi & 0xFFFF) << 16;
r_md += (r_lo >> 16);
return r_md;
}
Original comment by birdla...@gmail.com
on 3 Mar 2011 at 5:04
Running the code I get -49552. This is using gcc 4.3.1. I don't have time
right now to look into the excecution speed, but ill take a look later today
Original comment by joe.scha...@gmail.com
on 3 Mar 2011 at 5:35
Yes I get the same value but the accurate one is -52352. cycle count seems very
high for this code.
let see what you get for cycle count
Original comment by birdla...@gmail.com
on 3 Mar 2011 at 6:29
How are you measuring cycle count
Original comment by joe.scha...@gmail.com
on 3 Mar 2011 at 6:36
Birdland, are you targetting ARM, if so then we could make a much faster
assembler optimized version, especially if it has a clz instruction.
This implementation is very generic and aimed at accuracy, there's a similar
divide function to divide a 64-bit number by a 32-bit number in the newly
submitted int64.h header which may work better, though it is largely the same.
Original comment by Flatmush@googlemail.com
on 3 Mar 2011 at 7:52
Ok so I have never run a profiler before so I am not entirely sure what I am
doing here, but here it goes.
so as a test I took birdlands code and copied it into fix16.c appending an _b
to the function name.
The standard fix16_div is using 64 bit math.
I threw in mul for good measure
I modified the main code to be:
int x = 5063265; //77.2593;
int y = -6338360;//-96.7157;
d = fix16_div_b(x, y);
k = fix16_div(x,y);
t = fix16_mul(x,y);
printf("%d \n",d);
printf("%d \n",k);
printf("%d \n",t);
return 0;
compiled it with gcc 4.2.1 with the -g option. Then used valgrind to look at
the number of instructions per function. Here is what I came up with.
#############################################################################
Results for Core Duo @ 2.0Ghz(32 bit)
valgrind --tool=callgrind --toggle-collect=fix16_div_b ./main
No optimization = 267
O3 = 198
Os = 209
valgrind --tool=callgrind --toggle-collect=fix16_div ./main
No optimization = 2341
O3 = 2334
Os = 2334
valgrind --tool=callgrind --toggle-collect=fix16_mul ./main
No optimization = 56
O3 = 10
Os = 10
#############################################################################
Results for Core 2 Duo @ 2Ghz(64 bit)
valgrind --tool=callgrind --toggle-collect=fix16_div_b ./main
No optimization = 268
O3 = 188
Os = 200
valgrind --tool=callgrind --toggle-collect=fix16_div ./main
No optimization = 23
O3 = 14
Os = 14
valgrind --tool=callgrind --toggle-collect=fix16_mul ./main
No optimization = 18
O3 = 10
Os = 10
#############################################################################
Results for Arm Cortex M0 @ 48 Mhz (32 bit)
(this uses a timer, which is running at the system clock speed) since it is
single threaded
it always runs in the same amount of time.
Also I only compiled with Os, because the code grows too large without it.
also keep in mind that this processor has NO division instruction. So some
magic is happening
which is probably why the 32bit and 64 bit functions are similar
fix16_div_b
Os = 880
fix16_div
Os = 1104
fix16_mul
Os = 145
#############################################################################
Wow. So the fix16_div routine is very slow for 32 bit systems even if the
compiler can handle 64 bit numbers
The ARM doesnt take as large of a hit, but that is because the div_b is slow as
it is, mostly due to the lack of a division routine
Hope all of this makes sense...
Original comment by joe.scha...@gmail.com
on 4 Mar 2011 at 2:26
It's expected that 64-bit division would be slow (though concrete results are
very nice to have) because there is no trivial way to perform division unlike
multiplication.
I think that the algorithm I wrote could do with some work, but I'm not sure
there is going to be a very fast way of doing it.
Original comment by Flatmush@googlemail.com
on 4 Mar 2011 at 9:27
I've found a good way to do fixed point division on arm:
http://me.henri.net/fp-div.html
It'll be very useful for you birdland, I don't want to steal it to put it in
the SDK though so I'm going to have to contact him and check if he wants to
join in or if he's happy for his work to go into our project.
Original comment by Flatmush@googlemail.com
on 4 Mar 2011 at 12:27
Well we have Henri's permission (and encouragement) to use his code as part of
our library, so I'll add it when I get time (unless one of you want to).
Henri:
"By all means please use the code. It took a lot of work and research to figure
out, and at the time, I didn't want it all to go to waste - so I wrote it all
down in that article - but no-one was interested in publishing it at the time.
For this reason I would be very happy to see it part of an open source library.
That being said, I don't do ARM work any more - I don't even have the build
tools installed, so if you can use or adapt the code 'as-is' please let that be
my contribution."
Original comment by Flatmush@googlemail.com
on 4 Mar 2011 at 2:09
What I did with my application now is that I saw the range of my numbers in my
application and saw that 2^12 gives me 3 decimal places precision. So at the
moment just using simple integer division with precision upto 3 decimal places.
input1 = A*2^12;
input2 = B*2^12;
int div(int input1, int input2){
input1 = input1 << 12;
return input1/input2;
}
I need to keep track of my precision in my application.
Now I am stuck again in sqrt :( looking for some faster implementation.
Original comment by birdla...@gmail.com
on 4 Mar 2011 at 2:10
currently using the following sqrt. I have not tested its speed but it should
be varying depending on the input value. Any one knows more faster approach ?
int isqrt32 (int value)
{
int g = 0;
int bshft = 12;
int b = 1<<bshft;
do {
int temp = (g+g+b)<<bshft;
if (value >= temp) {
g += b;
value -= temp;
}
b>>=1;
} while (bshft--);
return g;
}
Original comment by birdla...@gmail.com
on 4 Mar 2011 at 2:57
With sqrt functions the faster approaches tend to be approximations, but this
is meant to be an accurate library (nothing saying we can't use macros and have
both though). Implementing the function in 32-bit or using ARM assembler would
be a good bet, though I'm thinking this sqrt stuff should be a new issue.
Original comment by Flatmush@googlemail.com
on 4 Mar 2011 at 3:10
Birdland are you using an ARM. If so I have an assembly fixed point division
algorithm that I know will run on an ARM thumb architecture and I think should
run on any other ARM architecture.
I will commit it but would like someone with something other than what I have
test it out
fix16_t fix16_div(fix16_t inArg0, fix16_t inArg1)
{
register int32_t quotient;
asm(
"num .req %[inArg0] @ Map Register Equates\n\t"
"den .req %[inArg1]\n\t"
"mod .req r2\n\t"
"cnt .req r3\n\t"
"quo .req r4\n\t"
"notden .req r12\n\t"
"sign .req r11\n\t"
"@ Make sure that we are not dividing by zero\n\t"
"cmp den, #0\n\t"
"bne .skip_div0\n\t"
"@ If we are dividing by zero return all -1\n\t"
"mov num, #1\n\t"
"neg num, num\n\t"
"b .quit\n\t"
".skip_div0:\n\t"
"@ Store the sign in the sign register for later\n\t"
"@ Sign is computed by exoring the numerator and denominator\n\t"
"@ and checking if the result is positive or negative\n\t"
"movs mod, num\n\t"
"eor mod,mod,den\n\t"
"movs sign, mod\n\t"
"@ Negate the denominator if necessary\n\t"
"@ The algorithm requires the denominator to be signed negative\n\t"
"cmp den, #0\n\t"
"bmi .skip_negate_den\n\t"
"neg den,den\n\t"
".skip_negate_den:\n\t"
"@ Negate the numerator if necessary\n\t"
"@ The algoright requires the numerator to be signed positive\n\t"
"cmp num, #0\n\t"
"bpl .skip_negate_num\n\t"
"neg num,num\n\t"
".skip_negate_num:\n\t"
"@ Begin the actual division algorithm\n\t"
"neg mod, den\n\t"
"movs notden, mod\n\t"
"movs mod, #0\n\t"
"add num, num, num\n\t"
".rept 48\n\t"
" adc mod, mod, mod\n\t"
" add mod, den, mod\n\t"
" bcs .+4\n\t"
" add mod, mod, notden\n\t"
" adc quo, quo, quo\n\t"
" add num, num, num\n\t"
".endr\n\t"
"movs num,quo\n\t"
"@ Check the sign to determine if we need to negate the answer\n\t"
"movs quo,sign\n\t"
"cmp quo,#0\n\t"
"bpl .skip_negate_quo\n\t"
"neg num,num\n\t"
".skip_negate_quo:\n\t"
".quit:\n\t"
/* output registers */
: [quotient] "=r" (quotient)
/* input registers */
: [inArg0] "0" (inArg0), [inArg1] "r" (inArg1)
/* clobbered registers */
: "cc", "r2", "r3", "r4", "r11", "r12");
return quotient;
}
Original comment by joe.scha...@gmail.com
on 22 Jun 2011 at 12:20
It would be interesting to see code run on multiple flavors of ARM Cortex-M
family: Cortex-M0 vs Cortex-M3 vs Cortex-M4 vs Cortex-M4F. See instruction
differences at http://en.wikipedia.org/wiki/ARM_Cortex-M#Overview
I added links to this project at
http://en.wikipedia.org/wiki/STM32#Development_tools
http://en.wikipedia.org/wiki/ARM_Cortex-M#Development_tools
Keep up the good work!
http://en.wikipedia.org/wiki/User:Sbmeirow
Original comment by sbmeirow
on 10 Dec 2011 at 6:19
Comparing cores does not make sense ...
Moreover, M3, M4 and M4F are exactly identical excepting DSP (+FPU) on M4(F)
which is not used by libfixmath since we're working on integer operations ...
you will measure exactly the same timings.
Nevertheless, I agree it could be interesting to compare toolchains.
Best regards,
Vincent.
Original comment by vincent....@gmail.com
on 10 Dec 2011 at 7:21
The Cortex-M0 / M0+ / M1 only has 32-bit multiply instructions with a
lower-32-bit result (32bit x 32bit = 32bit),
where as the Cortex-M3 / M4 / M7 includes additional 32-bit multiply
instructions with 64-bit results (32bit x 32bit = 64bit).
The Cortex-M4 / M7 also include DSP instructions for (16bit x 16bit) and (32bit
x 16bit) multiplication.
Original comment by sbmeirow
on 19 Feb 2015 at 9:38
Original issue reported on code.google.com by
birdla...@gmail.com
on 3 Mar 2011 at 3:23