m-ou-se / blog

My blog. https://blog.m-ou.se/
4 stars 0 forks source link

floats/ #8

Open utterances-bot opened 2 years ago

utterances-bot commented 2 years ago

Converting Integers to Floats Using Hyperfocus - Mara's Blog

A few years ago, due to some random chain of events, I ended up implementing a conversion from 128 bit integers to 64 bit floats. This would’ve turned out to be a complete waste of time, except that my final version is faster than the builtin conversion every compiler I tested. In this blog post, I’ll explain what happened, how floats work, how this conversion works, and how it got a bit out of hand.

https://blog.m-ou.se/floats/

Dushistov commented 2 years ago

A concrete example: the value 9.75 in binary looks like 1001.11, so it is represented as a f64 with an exponent of 1026,

exponent for 9.75 is just 4(10), or 100(2), where 1026 come from?

m-ou-se commented 2 years ago

exponent for 9.75 is just 4(10), or 100(2), where 1026 come from?

@Dushistov

If you check the table with exponents:

   :   \
1026    1mmm.mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
1025     1mm.mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
1024      1m.mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
1023       1.mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
1022       0.1mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
1021       0.01mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
1020       0.001mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
   :            \

you'll see the row with 1026 matches the binary represenation of 9.75:

1026    1mmm.mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
        1001.11000… (=0.75)

Another way of saying it, is that 9.75 is 1001.11 = 1.00111 × 2³, and 3 + 1023 = 1026.

SUPERCILEX commented 2 years ago

Can these be turned back into compiler built-ins? Or is there a reason not to do that?

lilly-lizard commented 2 years ago

what a gorgeous project! I can see why you put so much time into it, lots of unholy fun here.

I found that your final code is missing a bracket for the mantissa calculation and fails my tests as a result :'(

let m = a + (b - (b >> 63 & !a)) >> 63;

should be:

let m = a + ((b - (b >> 63 & !a)) >> 63);

because + has stronger precedence than >>

(although I personally prefer the following:)

let m = ((a) + (((b) - (((b) >> (63)) & (!a))) >> (63)));
m-ou-se commented 2 years ago

@lilly-lizard Oh oops. Those disappeared while editing this blog post somehow. Fixed. :)

m-ou-se commented 2 years ago

Can these be turned back into compiler built-ins? Or is there a reason not to do that?

@SUPERCILEX Yeah, I opened a PR for the compiler_builtins crate, two years ago, but got distracted and never finished it. It should probably also end up in Clang's compiler-rt and Gcc's libgcc, though, for it to be useful on all platforms.

SUPERCILEX commented 2 years ago

Haha, makes sense!

scips commented 2 years ago

I don´t know how I ended reading your blog on a Sunday morning but reading you gave me a great refreshing kick on how one person can share something that will speed up the work of millions. Thanks for sharing! And I love the small notes of humor all along the journey. I was almost feeling what you felt with the "Boom" 💥, the "Uh", the "wait"... and I was thinking on how you wrote your post while revisiting your journey. Sharing is caring! Have a nice day!

DavidHarper commented 2 years ago

I'm reminded of some code that I wrote more than 30 years ago when I needed to convert a tape full of binary data from IBM VM/370 native 64-bit float to IEEE-754 64-bit float, without loss of any accuracy. The VM/370 used a base-16 exponent, and therefore (unlike IEEE-754) no implied leading 1. I could have written a FORTRAN program to dump the data in decimal format on the IBM and read it back on the target machine, but that seemed way too easy, and I was young and cocky, so I read the IBM assembler manual and got to work. I still have that code. Here's a gist of it, but good luck finding a machine that will run it: https://gist.github.com/DavidHarper/ba20799c6b72d6131f263316d53723fb

steven-wheeler commented 2 years ago

This reminds me of dealing with the 32-bit float format about 25 years ago, when I was trying to implement as much floating-point as I could fit into a BASIC interpreter running on a PIC16C74 processor.

lowasser commented 2 years ago

Enjoyed this a lot! Once upon a time I did something similar for Java's BigInteger.doubleValue(): https://bugs.openjdk.java.net/browse/JDK-7131192. But you managed to make yours branch-free, and much smoother and more compact! Very nice.

inicula commented 1 month ago

Hello! Nice write-up!

I was playing around a bit with your optimizations, and tried using predication instead of ifs. Have you looked into doing this by any chance?

I did a (quite primitive) benchmark and I can't really reproduce the result in the original benchmark suite, but I'm curious if my change is actually an improvement (at least for x64). For the random inputs that I've tried, the 0.6-0.7ns speed gain is consistent.

Let me know what you think. :)

Note: my change is essentially:

diff --git a/f.rs b/f.rs
index c69d463..010702c 100644
--- a/f.rs
+++ b/f.rs
@@ -4,6 +4,6 @@ pub fn u128_to_f64(x: u128) -> f64 {
     let a = (y >> 75) as u64;
     let b = (y >> 11 | y & 0xFFFFFFFF) as u64;
     let m = a + ((b - (b >> 63 & !a)) >> 63);
-    let e = if x == 0 { 0 } else { 1149 - n as u64 };
+    let e = (x != 0) as u64 * (1149 - n as u64);
     f64::from_bits((e << 52) + m)
 }