Open cjdelisle opened 5 years ago
A couple of comments about posit_add_exact
:
int comparator(positN_t *a, positN_t *b) {
if (*a == 0 && *b == 0) { return 0; }
if (*a == 0) { return -1; }
if (*b == 0) { return 1; }
positN_t res[2] = posit_add_exact(*a, *b);
*a = res[0]; *b = res[1];
return 1;
}
...
qsort(array, length, sizeof(positN_t), comparator)
and after that, the sum of the array will be the same as it was before but the array will be reduced to the smallest number of non-zero posits that still sum to the same number, they will be ordered from nearest to zero to farthest from zero.
Is there a particular underlying dynamic that forces the mul_promote to have to be recorded in a posit that has a different es value? The quires for a posit with es and es+1 are not compatible in the sense that their radix points are different.
In the C++ code, we keep the posit config of the add and multiply for quire accumulation the same as the posit arguments, but simply carry all the extra adder/multiplier bits.
The first reason is because I expect hardware to settle on posit<8,0>, posit<16,1>, posit<32,2> and posit<64,3> so if you convert a posit<8,0> to a posit<16,0> that's not super useful because none of the hardware instructions will support it. With this rule, you can do posit<32,2> multiplication and posit<64,3> add_exact to implement dot product with stateless hardware and no software manipulation of posits.
That is indeed a very strong argument.
Research questions to be answered:
1- are all the samples of posit<nbits,es> contained inside posit<nbits*2,es+1>? 2- are all the samples of unrounded add and mul in <nbits,es> contained inside <nbits*2,es+1>?
Good questions, I always consider that multiplication doubles the required bits. My reasoning about why this should work is:
If the regime size is larger than 2 bits, it should be simple to reason that the resulting regime size should be smaller than the sum of regime sizes (is it 1/2 the sum?) so regimes larger than 2 should yield a larger number of available bits when multiplying.
For addition we should not care because add_exact() outputs 2 posits of the same size rather than doing a promotion.
disclaimer: I'm a programmer, not a mathematician, if this reasoning looks hacky it's because it is...
Added arbitrary posit conversions to get a feel for the promotion. I was worried about the geometric region, but all is good.
with respect to add_exact/sub_exact. The regions between [0, minpos] and [maxpos, NaR] do not have a representation for r
.
Posit TwoSum validation for posit<8,1>
geometric rounding region
a : 00000001 : +0.000244140625
b : 00000001 : +0.000244140625
s : 00000010 : +0.0009765625
aApprox = s - a : 00000010 : +0.0009765625
bApprox = s - aApprox : 00000000 : +0
aDiff = a - aApprox : 11111110 : -0.0009765625
bDiff = b - bApprox : 00000001 : +0.000244140625
r = aDiff + bDiff : 11111110 : -0.0009765625
s + r : 00000000 : +0
a + b : 00000010 : +0.0009765625
FAIL
a : 00000001 : +0.000244140625
b : 00000010 : +0.0009765625
s : 00000010 : +0.0009765625
aApprox = s - a : 00000000 : +0
bApprox = s - aApprox : 00000010 : +0.0009765625
aDiff = a - aApprox : 00000001 : +0.000244140625
bDiff = b - bApprox : 00000000 : +0
r = aDiff + bDiff : 00000001 : +0.000244140625
s + r : 00000010 : +0.0009765625
a + b : 00000010 : +0.0009765625
PASS
this implies that the operators add_exact/sub_exact need to special case that case.
@cjdelisle Is that possible for your code?
Can you push that in a branch so I can see what I'm looking at in the code ? I'm trying to make out just where the problems are...
I am experimenting in the https://github.com/stillwater-sc/hpr-blas repo as we doing a lot of math experiments for application requirements, so I already had Knuth's twoSum algorithm implemented there.
The test is numerical_two_sum.
When you have a posit with es > 0, you end up having three distinct 'regions' of discretization:
When you are adding values that operate purely with regime encodings (no exponent or fraction bits), i.e. minpos/maxpos, then you don't have enough bits to express the difference between exact and the rounded result, as they would be expressed in terms of exponent bits, which are lopped off in that extreme operating point of the posit arithmetic.
Ahh I see, but I think it should be ok to declare that maxpos and minpos remain the maximum and minimum values, even when using add_exact(). I'm still working on trying to replicate your results with a little test...
Thinking further, it strikes me that in no case should add_exact fail, if it can do nothing else, it can just return its arguments, right ?
I wrote a bruteforce implementation of posit8_add_exact() as a reference:
static posit8x2_t posit8_add_exact_bruteforce(posit8_t a, posit8_t b) {
// searching for [x,y] where x+y = a+b and y is the smallest
// obvious first step is swap a,b so that b is always <= a
if (posit8_cmp(a,b) < 0) {
posit8_t _a = a;
a = b;
b = _a;
}
double realsum = posit8_tod(a) + posit8_tod(b);
posit8x2_t best;
best.x = a;
best.y = b;
// nan special case
if (isnan(realsum)) {
// return a NAR and a ZERO so that someone running add_exact
// as a mutative sorting comparitor and discarding zeros
// will end with 1 NAR only.
best.x = NAR8;
best.y = ZERO8;
return best;
}
// search for a smaller value of y which satisfies x+y == a+b
for (int i = 0; i <= 0xffff; i++) {
posit8_t y = posit8_reinterpret(i >> 8);
posit8_t x = posit8_reinterpret(i & 0xff);
double sum = posit8_tod(x) + posit8_tod(y);
if (sum != realsum) { continue; }
if (posit8_cmp(y, best.y) < 0) {
best.x = x;
best.y = y;
}
}
return best;
}
I'm not exactly sure how to improve it but I think it's pretty obvious that it should never fail to return to values which sum to the same as the inputs...
I fixed a few issues with this function and improved the performance a little bit, I think this is correct though I'm really never that sure...
static posit8_t posit8_abs(posit8_t x) {
if (x.v & 0x80) { return posit8_sub(ZERO8, x); }
return x;
}
static posit8x2_t posit8x2(posit8_t x, posit8_t y) {
posit8x2_t out;
out.x = x;
out.y = y;
return out;
}
static posit8x2_t posit8_add_exact_bruteforce(posit8_t a, posit8_t b) {
// searching for [x,y] where x+y = a+b and y is the smallest
// obvious first step is swap a,b so that b is always <= a
if (posit8_cmp(posit8_abs(a), posit8_abs(b)) < 0) {
//printf("abs a < b %f %f\n", posit8_tod(a), posit8_tod(b));
return posit8_add_exact_bruteforce(b, a);
}
double realsum = posit8_tod(a) + posit8_tod(b);
// nan special case
if (isnan(realsum)) {
// return a NAR and a ZERO so that someone running add_exact
// as a mutative sorting comparitor and discarding zeros
// will end with 1 NAR only.
return posit8x2(NAR8, ZERO8);
}
// search for a smaller value of y which satisfies x+y == a+b
// start scanning at 0 and work up switching between positive and negative
// to find the value of smallest magnitude which can satisfy x+y == a+b
uint8_t end = posit8_bits(b) & 0x7f;
for (int i = 0; i < end; i++) {
posit8_t py = posit8_reinterpret(i);
posit8_t ny = posit8_sub(ZERO8, py);
double dpy = posit8_tod(py);
double dny = posit8_tod(ny);
for (int j = 0; j < (1<<8); j++) {
posit8_t x = posit8_reinterpret(j);
double dx = posit8_tod(x);
if ((dx + dpy) == realsum) { return posit8x2(x, py); }
if ((dx + dny) == realsum) { return posit8x2(x, ny); }
}
}
printf("No better solution was found for %f %f\n", posit8_tod(a), posit8_tod(b));
return posit8x2(a, b);
}
For "smaller" we really want smaller magnitude because the point is that you can use this as a mutative sorting comparator and it will collapse a list of numbers down to the shortest set which represents the same sum.
so for adding minpos + minpos, for posit configurations that have exponent bits, the sum is minpos and you would not have a value that is smaller than minpos. In that situation, which I can recognize mathematically, if I return minpos and minpos, will your mutative sorting still work?
Also, I am not familar with mutative sorting, can you provide me a link you think is good to get educated?
If I return minpos and minpos, will your mutative sorting still work? <-- yes
"mutative sorting comparitor" is just a term I invented now to describe a form of addition which uses a sorting like approach. So in general this is a big no-no but what you can potentially do (depending on the sorting function) is to actually change the inputs inside of the comparitor, e.g.
static int compare(const void* a, const void* b) {
posit8x2_t out = posit8_add_exact( ((posit8_t*)a)[0], ((posit8_t*)b)[0]);
((posit8_t*)a)[0] = out.x;
((posit8_t*)b)[0] = out.y;
// we will never return -1 because we already effectively did the swap
return (out.x == out.y) ? 0 : 1;
}
...later on
qsort(list, list_len, sizeof(posit8_t), compare);
Again, this is generally a Bad Idea to just rely on the implementation of quicksort not to freak out when we modify it's inputs from under it, but AFAIK in typical single-threaded sorting algorithms, this will actually work, and if you want something more robust, you can copy a well known sorting algorithm and then use it to build an efficient adder. The result after "sorting" an array of posits should be a bunch of zeros followed by the shortest number of posits which still sum to the same value.
An additional feature of this pattern is if you have a CUDA sort function, you can pretty quickly build a CUDA adder from it...
Would you be able to give me a sorting algorithm that you want to work with this add_exact, so that I can use it as a target for the development and testing of these two functions: add_exact, sub_exact.
I should be able to put together a basic test, and I expect that (ab)using libc qsort will work on our systems....
WDYT of this as a way of implementing it:
static posit8x2_t posit8_add_exact1(posit8_t a, posit8_t b) {
// if either is nar, return (nar,0)
if (posit8_isnar(a) | posit8_isnar(b)) { return posit8x2(NAR8, ZERO8); }
// b should always be less than or equal to a, in magnitude
if (posit8_cmp(posit8_abs(a), posit8_abs(b)) < 0) {
posit8_t _a = a;
a = b;
b = _a;
}
// if b is zero then return a,0
if (!posit8_cmp(b, ZERO8)) { return posit8x2(a, ZERO8); }
double dsum = posit8_tod(a) + posit8_tod(b);
posit8_t sum = posit8_fromd(dsum);
double rdsum = posit8_tod(sum);
double drem = dsum - rdsum;
posit8_t rem = posit8_fromd(drem);
double rdsum2 = posit8_tod(rem);
double remrem = drem - rdsum2;
if (remrem != 0) {
printf("inexact\n");
abort();
}
return posit8x2(sum, rem);
}
With that implementation of add_exact I get a nicer add time:
[100%] Linking CXX executable c_api_exact_test
[100%] Built target c_api_exact_test
750.265625 == 750.265625, 1024 posits compressed to 15 in 10 cycles 485 micros
515.859375 == 515.859375, 1024 posits compressed to 13 in 10 cycles 547 micros
108.203125 == 108.203125, 1024 posits compressed to 4 in 9 cycles 520 micros
928.515625 == 928.515625, 1024 posits compressed to 19 in 11 cycles 518 micros
310.234375 == 310.234375, 1024 posits compressed to 9 in 9 cycles 490 micros
353.000000 == 353.000000, 1024 posits compressed to 7 in 10 cycles 498 micros
444.843750 == 444.843750, 1024 posits compressed to 11 in 10 cycles 506 micros
753.218750 == 753.218750, 1024 posits compressed to 15 in 9 cycles 496 micros
445.218750 == 445.218750, 1024 posits compressed to 11 in 10 cycles 467 micros
370.796875 == 370.796875, 1024 posits compressed to 8 in 9 cycles 469 micros
610.593750 == 610.593750, 1024 posits compressed to 13 in 10 cycles 487 micros
-33.093750 == -33.093750, 1024 posits compressed to 5 in 11 cycles 507 micros
485.187500 == 485.187500, 1024 posits compressed to 11 in 11 cycles 488 micros
674.203125 == 674.203125, 1024 posits compressed to 13 in 10 cycles 569 micros
785.718750 == 785.718750, 1024 posits compressed to 17 in 10 cycles 542 micros
518.453125 == 518.453125, 1024 posits compressed to 11 in 10 cycles 539 micros
500.578125 == 500.578125, 1024 posits compressed to 12 in 10 cycles 528 micros
837.859375 == 837.859375, 1024 posits compressed to 17 in 9 cycles 564 micros
154.781250 == 154.781250, 1024 posits compressed to 5 in 10 cycles 865 micros
256.687500 == 256.687500, 1024 posits compressed to 5 in 11 cycles 783 micros
392.875000 == 392.875000, 1024 posits compressed to 9 in 10 cycles 849 micros
390.078125 == 390.078125, 1024 posits compressed to 9 in 10 cycles 862 micros
335.250000 == 335.250000, 1024 posits compressed to 7 in 9 cycles 743 micros
351.984375 == 351.984375, 1024 posits compressed to 7 in 10 cycles 703 micros
587.328125 == 587.328125, 1024 posits compressed to 12 in 10 cycles 726 micros
363.421875 == 363.421875, 1024 posits compressed to 9 in 10 cycles 751 micros
891.812500 == 891.812500, 1024 posits compressed to 17 in 10 cycles 754 micros
570.468750 == 570.468750, 1024 posits compressed to 13 in 9 cycles 719 micros
469.593750 == 469.593750, 1024 posits compressed to 11 in 9 cycles 728 micros
488.250000 == 488.250000, 1024 posits compressed to 11 in 9 cycles 796 micros
719.984375 == 719.984375, 1024 posits compressed to 15 in 10 cycles 787 micros
-155.312500 == -155.312500, 1024 posits compressed to 5 in 9 cycles 783 micros
Average time = 627
notgay:build user$
I've implemented a more generalized add_exact()
function, here it converts to posit16_t rather than to double in order to do the addition and subtraction. The idea is that if we can exhaustively verify posit8 and posit16, then we can hope to believe that posit32 and posit64 are ok.
I also fixed a bug in add_exact_bruteforce()
caused by my seemingly chronic incorrect assumption that I can flip the sign just by flipping a bit...
In order to implement an API which can be fulfilled by a stateless FPU yet is able to efficiently support quire operations, I need the following operations:
posit[2] posit_add_exact(posit a, posit b)
:a
andb
of the same parametersmax(a,b), min(a,b)
posit[2] posit_sub_exact(posit a, posit b)
: same as add_exact with b negatedposit<nbits*2,es+1> posit_mul_promote(posit a, posit b)
:a
andb
of the same parametersposit<nbits*2,es+1> posit_div_promote(posit a, posit b)
:posit posit_frexp(posit a, int* exp_out)
:posit posit_ldexp(posit a, int exp)
: