jk-jeon / dragonbox

Reference implementation of Dragonbox in C++
Apache License 2.0
607 stars 39 forks source link

Looking for advice - how to convert float to double, keeping to_string(double) same as to_string(float) #28

Closed scherepanov closed 2 years ago

scherepanov commented 2 years ago

Hi, Can you give some advice, how to convert float (4 bytes) to double (8 bytes).

I am looking to convert float -> double in a way that to_string(double) to produce same string as original to_string(float).

Any advice would be really appreciated.

My current idea is to you dragonbox to convert float to chars, and then use some fast chars->double converter with round-trip guarantee.

Would be very nice if you will tell me that my idea is stupid, and describe a way that is much better.

Thanks in advance!

jk-jeon commented 2 years ago

Firstly, why do you need that? Because it sounds like you are likely trying to do something unnecessary or even wrong. Is it something absolutely necessary, or just to appeal to some vague aesthetics?

My current idea is to you dragonbox to convert float to chars, and then use some fast chars->double converter with round-trip guarantee.

Sounds like a reasonable approach, but I don't know if it is guaranteed to work. It is likely that if you apply dragonbox to the resulting double then you will get the same string, but I'm not totally sure. If it is the case, probably a rigorous proof of it might be not so simple. Or there might be some counterexamples so that you need to take care of them specially.

By the way, if your goal is to match the results of std::to_string, then the situation is very different. Note that by default std::to_string does NOT have roundtrip guarantee, because its result is required to be same as that of std::sprintf(buf, "%f", x); see here for the reference. If you look at the specification of std::sprintf here, with %f the result is always in the fixed-point form with 0 to 6 decimal digits after the decimal point. This is radically different from what dragonbox spits out. Another caveat is that std::to_string is locale-dependent. In fact, there is a proposal which tries to make std::to_string locale-independent and also does what dragonbox does.

scherepanov commented 2 years ago

@jk-jeon I am really appreciate your comment. Sorry I did not notice it in timely manner.

I am loading data into database, and database internally has only double. All current load is using Java toString on client, converting source Java float into string. On database side, some internal parser converts it to double.

Next I want to do faster... switched to binary format to transfer data to database, where I wrote my own c++ parser. I have coming float in binary format, and naively did static_cast float to double. It works perfectly much fast... I was happy until users bark back. They used to compare in SQL "price = 10.1", and it stop working, all SQL not working.

I had to resort to horrible workaround

double floatToDouble(float f) { char sa[64]; std::sprintf(sa, "%f", f); char* end; return std::strtod(sa, &end); }

Users are happy and now performance of my shiny binary parser is lagging built-in CSV parser... that makes switching to binary format completely wrong.

My goal is to convert float to double same way as it would be though Java toString.

Your advice would very appreciated.

Another case is when I have fixed decimal, i.e. long int and number of decimal digits after point. Something like 101 and 1 would be 10.1. I implemented conversion to double similar way, except I am forming string by extracting whole and fraction and formatting then into string. Performance is horrible.

I am not looking for round-trip guarantee, just to match result of Java toString and subsequent conversion string to double.

My question, why I need to convert to char. Both dragonbox and fixed decimal allow me to produce whole and fraction from source float. Can you recommend how I can convert whole and fraction into double, without intermediate conversion to string? My understanding, conversion to string produces more overhead, because it requires resource to convert to string, and conversion from string back to whole and fraction. Avoiding chars should save quite a lot of resources.

Can you recommend some fast converter (whole, fraction) -> double? Or, I have to go through chars????

Your advice is really appreciated.

jk-jeon commented 2 years ago

They used to compare in SQL "price = 10.1", and it stop working, all SQL not working.

Ooohh. That's creepy!

Anyway, I see. So the goal is to match the spec of Java's toString, not that of std::to_string, and if I understood you correctly, you don't have the access to (or don't want to change) the database side's parser, right?

For some reason googling "Java Float toString" doesn't give me the precise spec of it, do you know of a reference?

Can you recommend some fast converter (whole, fraction) -> double? Or, I have to go through chars????

This is not a trivial question if you want to be absolutely correct. To make it precise, is it correct that fraction is given as an array of decimal digits of indefinite length?

scherepanov commented 2 years ago

Thank you for answering, it is being very appreciated!

My use case is not that complicated. It is stock prices, and typically it is a few digits before point and two after, like 345.67 It is being represented as float, or as fixed decimal. Fixed decimal is a 8-byte int, with 8 digits after point. To get whole you do X / 10000000, fraction you do X % 10000000. Nothing fancy like array of digits of indefinite length. Both Java toString and c++ std::sprintf produces same result for me.

I will have whole and fraction for both cases - from float (using dragonbox) and from fixed decimal (by divide and mod).

Question - is there any way to convert whole and fraction to double, without converting to chars?

Whole is max 5 digits, fraction is always 2, sometimes 5.

scherepanov commented 2 years ago

Dragonbox rocks!!! It is so fast that I thought I had error in code :-) I am using Dragonbox on a path FLOAT -> DOUBLE.

I tried to convert decimal into chars using Dragonbox libary, but this method is not exposed as public API and I was not able to overcome c++. It appears it is a trivial task, and my own very naiive implementation decimal -> chars is very fast (I believe yours is faster!).

Now looking into fast_float to convert chars into double. Still question - can I skip converting into chars and have decimal -> double directly, without intermediate chars.

scherepanov commented 2 years ago

Thanks for advice, everything works fine for me!

jk-jeon commented 2 years ago

Fixed decimal is a 8-byte int, with 8 digits after point.

So it is representable by a 64-bit int without any loss by adding the fraction and the whole part times 100000000. Probably, you can just compute static_cast<double>(N) * 1e-8, where N is the 64-bit int you get from the previous step, to convert it into double. But I have to think about it if the result is always guaranteed to be correctly rounded.

Both Java toString and c++ std::sprintf produces same result for me.

Here is the precise step for Java: https://docs.oracle.com/javase/8/docs/api/java/lang/Float.html#toString-float-

It is quite radically different from std::to_string: the former does something similar to Dragonbox, while the latter does direct rounding into a decimal number with a fixed precision.

Whole is max 5 digits, fraction is always 2, sometimes 5.

Which contradicts to your previous description ("fixed decimal is a 8-byte int, with 8 digits after point")?

Also, keep in mind that I don't know if applying Dragonbox into a float and then converting it back to double will always do the right thing. Right now I didn't think about it thoroughly; I guess it might produce a wrong result for very small inputs for example.

scherepanov commented 2 years ago

First, big thanks for thinking about my use case.

I have a very small data range. Source data representation is 8 byte int with 8 digits after decimal point. But 99.99999% of stock prices have two digits after decimal, and one, two or three (sometimes four) digits before decimal point.

Prices are padded with 6 zero (000000) after two digits after decimal point.

Price represent 6 digits, with 6 zeros appended to fit into fixed decimal format.

That small data range makes pretty much any algo to work correctly.

Yes I did tests on more digits, and with a lot of digits there are problems. Very fortunately, in real life data I do not have this case.

Converting chars -> floats and back with round trip guarantee unexpectedly happens to be extremely complicated problem. I do have a faint memory of university study about subject, and did not expect it to be so difficult. Most people do not even know this problem exist.

Thanks for help again!

jk-jeon commented 2 years ago

I agree that handling of floating-point numbers in general, especially when things need to be 100% correct, is indeed a big headache, often is just practically impossible. Given limited amount of information I still don't know what exactly you wish to achieve, but I'm glad things work now. Have a nice one!

scherepanov commented 2 years ago

Here is more background information on case. Users are generating 4-bytes floats. Database support only 8-byte double data type. Prod data load is done by converting on client 4-byte float to string in CSV format, using Java to_string. CSV file sent to database, where is is parsed by internal built-in CSV parser. Internal parser is converting string to double and places in table column (loads data). Example: Table column spotref is loaded with value 3904.2532 It is actually questionable what really loaded in database. But all client tools show string representation as 3904.2532. Typically client tools get data from database in binary format, and convert to string representation for display on screen. Client tool is Java-based, I assume internally it is calling Java to_string on double and show resulting string on display.

Alternatively, I can request conversion to string on database server: select to_char(spotref) .... In this case, database is c++, and it is converted using c++ to_string. Result is same, I can see on client 3904.2532.

Load process is inefficient, as it requires to convert float -> string in Java client, and on database string -> double in internal built-in CSV parser.

My exercise in loader performance improvement is to switch to binary format, and send to database binary format with original unchanged 4-byte float binary value. My custom c++ parser on database side pick up float, do naive static_cast(float), and place resulting double into table column.

As expected, users start seeing garbage digits at the end of value, 3904.253173828125 instead of 304.2532 Most important, filtering in WHERE clause stop working: Before users can write in SQL: WHERE spotref = 3904.2532 and it is matching all rows as expected. With binary load and naive conversion float -> double with static cast it does not work: WHERE spotref = 3904.2532 returns zero rows. That is a show stopper for users.

I will do one more post with code I wrote using dragonbox and fast_float. I am extracting decimal from float using dragonbox, and wrote new procedure in fast_float that accepts decimal and generate double.

Despite my last night excitement, my code is not working correctly, I will be re-writing it today. Will post here when done.

Hopefully this is providing you with complete insight what is use case.

scherepanov commented 2 years ago

Here is code I am using to convert float -> double:

double floatToDouble(float f) { constexpr int buffer_length = 1 + // for '\0' jkj::dragonbox::max_output_string_length; char buffer[buffer_length]; char* end_ptr = jkj::dragonbox::to_chars(f, buffer); double d; fast_float::from_chars(buffer, end_ptr, d); return d; }

Works fast and perfectly well! Thanks for nice and fast library!

With above code, from users perspective, load from binary format produces same data in database as from string. WHERE clause in SQL also continue to function properly.

fast_float do convert chars to decimal (exponent, mantissa, negative), and proceed from there. I tried to use Dragonbox to_decimal, and wrote my own procedure in fast_string that skip chars parsing and uses (exponent significand is_negative) from Dragonbox. My naive approach segfaulted, as apparently fast_float collect more info on parsing chars and uses it later. I asked for help in fast_float. If I will get it, would be nice, if not, current solution is very fast.

Hopefully, now you have full picture of what is my use case, and how I am trying to solve it.

Thanks for your time!