jk-jeon / dragonbox

Reference implementation of Dragonbox in C++
Apache License 2.0
588 stars 37 forks source link

JSON compliant to_chars? #65

Closed stephenberry closed 1 month ago

stephenberry commented 1 month ago

I'd like to use dragonbox for serializing floats in Glaze. What I'm particularly looking for is a JSON compliant to_chars implementation that has the shortest string result. It seems like dragonbox's to_chars will print out ...E0 when it is not necessary.

The other requirement for JSON compliance is that infinity and nan should be printed as "null".

Could options to support these features be added to dragonbox or should I fork the code and edit it myself?

I'd also like the to_chars to be header only, but I don't mind doing this conversion myself and keeping it up to date with dragonbox.

Thanks for all the hard work you've put into this great library!

stephenberry commented 1 month ago

Also, a number like 10.1 right now prints as 1.01E1, which adds 2 additional characters to the output, and it is harder for users to read.

I'm curious if you know what the performance impact of supporting this cleaner output would be?

jk-jeon commented 1 month ago

Hi, thanks a lot for your interest!

Since this is a duplicate of https://github.com/jk-jeon/dragonbox/issues/35 I'm closing this, but please feel free to discuss further in this thread (or in https://github.com/jk-jeon/dragonbox/issues/35) if you want.

To answer your specific questions:

It seems like dragonbox's to_chars will print out ...E0 when it is not necessary. Also, a number like 10.1 right now prints as 1.01E1

You're right, please read my answer in https://github.com/jk-jeon/dragonbox/issues/35.

Could options to support these features be added to dragonbox or should I fork the code and edit it myself?

Maybe in the future, but not planned at this moment. PR is welcome though!

Ideally, a hypothetical to_chars should do what std::to_chars do (except for the bound-checking API which I dislike) , i.e., to print a string with the least number of characters. But there is a nontrivial issue here: there are cases when fixed-point notation yields smaller number of characters, but the number of digits that will appear in the integer part of the fixed-point output exceeds the number of digits in the decimal significand that Dragonbox finds out. The simplest solution is to fill out missing digits with 0, which retains the roundtrip gurantee and the shortness guarantee, but it breaks the correct rounding guarantee. So for that case we must do something else, and I did not yet really thought carefully about what could be a possible solution. Please see https://github.com/fmtlib/fmt/issues/3649 for more info.

I'm curious if you know what the performance impact of supporting this cleaner output would be?

I'm not sure. I will not be surprised either way. I think I saw supporting evidences in both ways.

I'd also like the to_chars to be header only,

To be absolutely honest, I really want to move away from header-only library 😃 constexpr is the only reason why I keep to_decimal header-only.

stephenberry commented 1 month ago

Thanks for your helpful reply. I will probably develop a to_chars implementation in Glaze using dragonbox::to_decimal under the hood. This can be an incubator for a potential pull request in the future. I just wanted to make sure I wasn't duplicating efforts.

I appreciate the prompt reply!

jk-jeon commented 1 month ago

Thank you so much!

stephenberry commented 1 month ago

Using modified code from yyjson and Glaze I was able to incorporate dragonbox and serialize in the desired JSON format. Here's the current implementation: https://github.com/stephenberry/glaze/blob/main/include/glaze/util/dtoa.hpp

I'm sure the performance can be improved, but it is quite efficient in its current form. Thanks for your library, it made this very straightforward!

jk-jeon commented 1 month ago

Here's the current implementation:

Didn't check all the details but looks pretty good to me. Just have a couple of comments:

https://github.com/stephenberry/glaze/blob/659bd8d2c5880d758162329c7d9e540f75f26c2c/include/glaze/util/dtoa.hpp#L125

As far as I can see this function does not remove trailing zeros, but is named as _trim. I initially supposed it's named as such because it removes them. I guess maybe it's better to call it _no_trim to contrast it to the one above it, unless there are other reasons to call it as _trim?

https://github.com/stephenberry/glaze/blob/659bd8d2c5880d758162329c7d9e540f75f26c2c/include/glaze/util/dtoa.hpp#L247

For IEEE-754 binary32, the exponent is at most of two decimal digits so you don't need this branching.

Your implementation does have the correct rounding issue I mentioned, but of course it is not necessarily a real issue depending on your goal. IIRC the JSON spec doesn't really mandate correct rounding, just roundtrip guarantee, but not sure if I'm remembering correctly. In any case I think it's good to be aware of this.

Thanks for your library, it made this very straightforward!

Really glad my work is being integrated into such a great library!

Also I love those small branchless tricks. Reminds me that I'm not really good at micro-optimizations lol

stephenberry commented 1 month ago

As far as I can see this function does not remove trailing zeros, but is named as _trim.

Yeah, I had at first intended to trim and then realized it was simpler to use dragonbox's trimming. I updated the name.

For IEEE-754 binary32, the exponent is at most of two decimal digits...

Thanks for this observation.

IIRC the JSON spec doesn't really mandate correct rounding...

Round-tripping and shortness are the primary goals. So, I also think this is okay.

Also, I had to make a fix for denormalized float writing as JSON requires a 0 before the exponent. E.G. -1.E-47 must be written as "-1.0E-47"

jk-jeon commented 1 month ago

IIRC the JSON spec doesn't really mandate correct rounding, just roundtrip guarantee, but not sure if I'm remembering correctly.

So I looked it up, and it seems the spec is very blurry in this respect, and basically allows the implementations to do whatever they want. So.... whatever.