milankl / SoftPosit.jl

A posit arithmetic emulator.
MIT License
45 stars 9 forks source link

conversion Float32 <-> Posit16 #55

Closed milankl closed 2 years ago

milankl commented 2 years ago

https://twitter.com/ProjectPhysX/status/1444007905058070531?s=20 suggests a new conversion between Float32 and Posit16

ProjectPhysX commented 2 years ago

Hi all,

I'm researching Posits as a potential storage format for the DDFs in the lattice Boltzmann method. Therefore I needed much faster conversion algorithms than previously available, especially to be suitable for GPUs. I wrote these conversion algorithms using just basic bit manipulation and to be branchless. They are much shorter than the existing SoftPosit conversion, and also much faster. Tests with them integrated into the LBM showed 22x performance gain compared to the existing SoftPosit implementation.

Below is the validation plot, showing digits = -log10(fabs(log10(x_converted_back_and_forth/x))), proving the conversion is bitwise identical and handles rounding/overflow correctly. Note that I have ditched the infinity and NaR definitions to spare some operations.

The purpose of these algorithms really is to be as fast as possible for applications where lots of conversions are needed, for example compressing numbers for storage in memory.

Regards, Moritz Lehmann

Posit

typedef unsigned int uint;
typedef unsigned short ushort;
float as_float(const uint x) {
    return *(float*)&x;
}
uint as_uint(const float x) {
    return *(uint*)&x;
}
int as_int(const float x) {
    return *(int*)&x;
}

ushort float_to_p160(const float x) {
    const uint b = as_uint(x);
    const int e = ((b&0x7F800000)>>23)-127; // exponent-bias
    int m = (b&0x007FFFFF)>>9; // mantissa
    const int v = abs(e); // shift
    const int r = (e<0 ? 0x0002 : 0xFFFE)<<(13-v); // generate regime bits
    m = ((m>>(v-(e<0)))+1+(e<-13)*0x2)>>1; // rounding: add 1 after truncated position; in case of lowest numbers, saturate
    return (b&0x80000000)>>16 | (e>-16)*((r+m)&0x7FFF) | (e>13)*0x7FFF; // sign | regime+mantissa ("+" handles rounding overflow) | saturate
}
float p160_to_float(const ushort x) {
    const uint sr = (x>>14)&1; // sign of regime
    ushort t = x<<2; // remove sign and first regime bit
    t = sr ? ~t : t; // positive regime r>=0 : negative regime r<0
    const int r = 142-(as_int((float)t)>>23); // evil log2 bit hack to count leading zeros for regime
    const uint m = (x<<(r+10))&0x007FFFFF; // extract mantissa and bit-shift it in place
    const int rs = sr ? r : -r-1; // negative regime r<0 : positive regime r>=0
    return as_float((x&0x8000)<<16 | (r!=158)*((rs+127)<<23 | m)); // sign | regime | mantissa
}

ushort float_to_p161(const float x) {
    const uint b = as_uint(x);
    const int e = ((b&0x7F800000)>>23)-127; // exponent-bias
    int m = (b&0x007FFFFF)>>10; // mantissa
    const int ae = abs(e);
    const int v = ae>>1; // shift, ">>1" is the same as "/2"
    const int e2 = ae&1; // "&1" is the same as "%2"
    const int r = ((e<0 ? 0x0002 : 0xFFFE<<e2)+e2)<<(13-v-e2); // generate regime bits, merge regime+exponent and shift in place
    m = ((m>>(v-(e<0)*(1-e2)))+(e>-28)+(e<-26)*0x3)>>1; // rounding: add 1 after truncated position; in case of lowest numbers, saturate
    return (b&0x80000000)>>16 | (e>-31)*((r+m)&0x7FFF) | (e>26)*0x7FFF; // sign | regime+exponent+mantissa ("+" handles rounding overflow) | saturate
}
float p161_to_float(const ushort x) {
    const uint sr = (x>>14)&1; // sign of regime
    ushort t = x<<2; // remove sign and first regime bit
    t = sr ? ~t : t; // positive regime r>=0 : negative regime r<0
    const int r = 142-(as_int((float)t)>>23); // evil log2 bit hack to count leading zeros for regime
    const uint e = (x>>(12-r))&1; // extract mantissa and bit-shift it in place
    const uint m = (x<<(r+11))&0x007FFFFF; // extract mantissa and bit-shift it in place
    const int rs = (sr ? r : -r-1)<<1; // negative regime r<0 : positive regime r>=0, "<<1" is the same as "*2"
    return as_float((x&0x8000)<<16 | (r!=158)*((rs+e+127)<<23 | m)); // sign | regime+exponent | mantissa
}

EDIT: Flush to zero now works correctly. EDIT 2: simplified evil log2 bit hack part

milankl commented 2 years ago

Instead of int r = 158-(as_int((float)((uint)t<<16))>>23); I would write r = leading_zeros(t) in Julia. Is r the actual number of regime bits or does it exclude the first regime that you removed?

ProjectPhysX commented 2 years ago

I just slimplified that part in the algorithm again (I made t type ushort instead of int to eliminate (uint)t<<16).

r = leading_zeros(t) is equivalent to r = 158-(as_int((float)t)>>23); when t is a 32-bit integer. In Julia you can do r = leading_zeros(t)-16. instead of r = 142-(as_int((float)t)>>23);.

r is the number of bits in the regime minus 1, so it excludes the regime termination bit.

milankl commented 2 years ago

Quick check: Does your version work for negative numbers too? Because you don't seem to implement the two's complement do you? And the rounding is just round to nearest tie away from zero, right? Because I believe the posit standard implements tie to even as for floats.

milankl commented 2 years ago

Okay, that's my take on this conversion. I don't need any branches (neither if nor ?) although I think it actually overflows where it shouldn't and at the moment underflows occur at minpos/4 and and not at minpos/2 (but I think SoftPosit never underflows in conversion???)

https://github.com/milankl/SoftPosit.jl/blob/d51c9f9a221d5d0061a46f0fde06fa5343104de8/src/conversionFloatToPosit.jl#L36-L68

It's currently more written for readability and not for performance. On an i5 Ice Lake I get the following timings (Julia v1.6.3)

julia> using SoftPosit, BenchmarkTools
julia> A = rand(Float32,1000000)
julia> @btime Posit16_new.($A)     # new conversion in pure Julia
  7.020 ms (2 allocations: 1.91 MiB)
julia> @btime Posit16.($A)         # SoftPosit's C conversion
  89.033 ms (2 allocations: 1.91 MiB)

So about a performance gain of 12-13x. However, if I disable the last two's complement (hence only positive floats are correctly converted) I get to 100x speedups

julia> @btime Posit16_new.($A);
  866.962 μs (5 allocations: 1.91 MiB)

So there's clearly room for improvement. Will try and look at that next week. @ProjectPhysX what are your C timings?

ProjectPhysX commented 2 years ago

My conversion works for negative numbers as well. I threat the conversion only for positive numbers and then copy over the sign bit. Positive and negative numbers are completely symmetric this way.

For rounding I do round-to-nearest-even, i.e. add 1 to the leftmost bit of what is truncated from the mantissa, for example to 01010|10 I add 00000|10 to get 01011|00 and then I truncate to 01011. This is equivalent to correct integer rounding if you would treat the mantissa like an integer, except for edge cases where it is a tie. For example this rounds 0.4999 to 0, 0.5001 to 1 and 0.5000 to 1 as well. I think this is different to the floating-point and also Posit standard, but it is equally accurate and makes the implementation much simpler.

As for timings, for P16_1 I get 54ns per back-and-forth conversion for SoftPosit and 9.5ns for my implementation, so speedup of 5.8x. For P16_0 I don't have a reference, but back-and-forth conversion takes 8.5ns. I have an 8700K at ~4.4GHz. On the GPU, with the conversion embedded in a lattice Boltzmann algorithm, I get ~20x speedup on my Nvidia Titan Xp. Here having branchless code makes a much larger difference as all threads within a workgroup need to execute both branches if only a single thread deviates.

milankl commented 2 years ago

My conversion works for negative numbers as well. I threat the conversion only for positive numbers and then copy over the sign bit. Positive and negative numbers are completely symmetric this way.

This is true for floats, but not for posits, which use two's complement for negative numbers. For example maxpos is 0111...1 but -maxpos is 100...01. It seems that you are only interested in conversions, in which you could change the definition of posits to a sign-magnitude representation. However, what do you do then with NaR? 100...0 would then be -0, and you'd need to steal another bitpattern for NaR.

For rounding I do round-to-nearest-even, i.e. add 1 to the leftmost bit of what is truncated from the mantissa, for example to 01010|10 I add 00000|10 to get 01011|00 and then I truncate to 01011. This is equivalent to correct integer rounding if you would treat the mantissa like an integer, except for edge cases where it is a tie. For example this rounds 0.4999 to 0, 0.5001 to 1 and 0.5000 to 1 as well. I think this is different to the floating-point and also Posit standard, but it is equally accurate and makes the implementation much simpler.

Note that with your method a (positive) tie like 00|10 is always rounded up to 01|00, this corresponds for integers to 0.5 to 1; 1.5 to 2; 2.5 to 3 etc. For floats this means all ties are always rounded away from zero (sign-magnitude repr), for posits (two's complement), all ties would be rounded up, as their ordering is reversed for negative numbers due to the two's complement. In order to remove this bias, tie to even is part of the float and also of the posit standard (https://posithub.org/docs/posit_standard.pdf, 4.1). It sounds a bit irrelevant, but actually the chances to hit a tie with float or posit arithmetic are suprisingly large due to their piecewise linear distribution of representable numbers.

As for timings, for P16_1 I get 54ns per back-and-forth conversion for SoftPosit and 9.5ns for my implementation, so speedup of 5.8x. For P16_0 I don't have a reference, but back-and-forth conversion takes 8.5ns. I have an 8700K at ~4.4GHz. On the GPU, with the conversion embedded in a lattice Boltzmann algorithm, I get ~20x speedup on my Nvidia Titan Xp. Here having branchless code makes a much larger difference as all threads within a workgroup need to execute both branches if only a single thread deviates.

Does that mean you time a single back and forth conversion, or did you apply this to a whole array? Looks like with my implementation I could get below 1ns one way, but as always comparisons are difficult as I'm just using a 2GHz i5...

milankl commented 2 years ago

Okay, I got it down to

julia> A = randn(Float32,1000000);
julia> @btime Posit16_new.($A);
  850.299 μs (5 allocations: 1.91 MiB)

meaning 0.85ns per conversion (one way only atm), about 110x faster than SoftPosit. There was another shortcut in the last two's complement. Let me know whether you get in any faster.

ProjectPhysX commented 2 years ago

Thanks for all your thoughts! Yes I only want sign-magnitude representation for my purposes, so I ditched NaR. Rounding the ties up or down for my purposes also is not relevant; I use Posits only as a compressed storage format and not for arithmetic. I measured timings by measuring time for about 30M back-and-forth conversions and dividing by 30M to get the time per conversion.

milankl commented 2 years ago
julia> using SoftPosit, BenchmarkTools

julia> function f(A)
           @inbounds for i in eachindex(A)
               A[i] = Float32_new(Posit16_new(A[i]))
           end
       end
f (generic function with 1 method)

julia> A = randn(Float32,1000000);
julia> @btime f($A);
  1.895 ms (0 allocations: 0 bytes)

Got the back & forth conversion (without memory allocation this time) down to <2ns on my macbook air (i5 10th gen, Ice Lake, 1.1GHz); get similar results (2.8ns) on an older i7 (7th gen, Kaby Lake, 3.6GHz). The code is

https://github.com/milankl/SoftPosit.jl/blob/9a27ce4259a10b92340a4755337edea03a17333f/src/conversionPositToFloat.jl#L20-L45

Feel free to use this version or investigate further where the speed differences are!

milankl commented 2 years ago

I've just made the new conversion to default for SoftPosit v0.4 with #61, it still includes slight changes in the special cases of the rounding mode compared to the posit (draft) standard which is now described in the README.md

milankl commented 2 years ago

The no overflow rounding mode

https://github.com/milankl/SoftPosit.jl/blob/817f97743a6a2a7380a70d0721c4d9c9b69909ae/src/conversionFloatToPosit.jl#L38

made things a bit more expensive, as I had to prevent rounding to NaR for [maxpos(Posit16),floatmax(Float32)] but it's still below 3ns for back & forth conversion.

milankl commented 2 years ago

Closed with #61