ulfjack / ryu

Converts floating point numbers to decimal strings
Apache License 2.0
1.19k stars 99 forks source link

How do we get this into Boost? #111

Open vinniefalco opened 5 years ago

vinniefalco commented 5 years ago

We were discussing how to get this into Boost and an obvious solution came up, ask @StephanTLavavej ! We could use something header-only with the option of separate compilation. I was working on a port but I stopped because the changes to support this use case would be extensive and this project looks like it is still in development and might see considerably more commits that I would have to port to my C++-mangled header-only solution.

The routines which I ultimately need are

// this is user-facing
struct number
{
    unsigned long long mantissa;
    short exponent;
    bool sign;
};

number double_to_number(double);
char* number_to_chars(char* buf, number);

And of course an implementation of C++17's to_chars (so that users of older C++ versions have access to it).

There are a number of Boost libraries that would benefit from this code such as lexical_cast and a few others.

ulfjack commented 5 years ago

I would like to refactor the code to provide a more consistent set of interfaces, including a low-level interface like what you describe. It would also be nice to have smaller lookup tables and to reduce the overlap between the existing tables, but I'm not sure how much of that will be possible. Finally, I want to add string2float and string2double parsers based on similar principles - I have a design in my head but haven't had the time to implement it.

Personally, I think header-only libraries are a mistake. I understand why they're attractive, but they seem undesirable from a build performance perspective.

That said, if there's a way I can make it easier for you and @StephanTLavavej to construct header-only libraries in a way that allows you to easily pull updates from here, that works for me. It's more important to make it easy for you to use the code than to insist on such principles (but I had to say it anyway :-p).

vinniefalco commented 5 years ago

Personally, I think header-only libraries are a mistake

You have to read my weasel-words carefully: "...with the option of separate compilation"

On paper header-only is bad I agree but when it comes to obtaining the most valuable currency in the universe (i.e. GitHub stars) a header-only library presents a lower barrier to entry for newbies especially if they are not experts at configuring their build system.

For seasoned users or when the newbies manage to create a large program which starts to take too long to compile, they can define a macro such as BOOST_ASIO_SEPARATE_COMPILATION or BOOST_BEAST_SEPARATE_COMPILATION, and then inside one of their .cpp files include a special header which causes function definitions to be compiled into a separate translation unit. This is the approach used in Beast, and we got a nice improvement in build performance, without losing the initial appeal of header-only. And also without tying it to messy build scripts. Example:

Public-facing header file declares the necessary complete types: https://github.com/vinniefalco/BeastLounge/blob/6a8ccbbb1967d172bbb234a82abb06f953db721c/include/boost/beast/_experimental/json/number.hpp

Functions use the macro BOOST_BEAST_DECL as a qualifier, this macro will substitute inline or nothing, depending on the build mode: https://github.com/vinniefalco/BeastLounge/blob/6a8ccbbb1967d172bbb234a82abb06f953db721c/include/boost/beast/_experimental/json/number.hpp#L145

Then at the bottom of the file we conditionally include the function definitions: https://github.com/vinniefalco/BeastLounge/blob/6a8ccbbb1967d172bbb234a82abb06f953db721c/include/boost/beast/_experimental/json/number.hpp#L222

Users who want the separate compilation simply define BOOST_BEAST_SEPARATE_COMPILATION and then include the file <boost/beast/src_extra.hpp> in any one of their existing or new .cpp files: https://github.com/vinniefalco/BeastLounge/blob/develop/include/boost/beast/src_extra.hpp#L33 https://github.com/vinniefalco/BeastLounge/blob/develop/test/lib_beast_extra.cpp#L12

It really is all about choice, this brings the best of both worlds while sacrificing nothing (well, a little bit of extra effort to organize the headears in a particular way).

StephanTLavavej commented 5 years ago

The necessary changes for to_chars are significant - I've been working on floating-point to_chars for almost a year (shortly after adding from_chars on May 23, 2018). You can see my work-in-progress at https://github.com/StephanTLavavej/ryu/tree/msvc-latest . Warning: this branch is volatile, being frequently rebased on master and force-pushed to rewrite history. Adapting upstream Ryu to the charconv interface has required a relatively small amount of work for bounds-checking and return types, but a large amount of work for fixed notation. There's also a bit of logic needed for general and "plain" format-switching. Completely unrelated code is needed for shortest and precision hexfloats. Then another batch of work is needed for decimal precision (followed by decimal precision general, which I am just about to start).

My derived implementation is header-only, despite the fact that we ship separately compiled DLLs/LIBs, partly because adding separately compiled code is surprisingly onerous for us (due to binary compatibility concerns; we can do it as we did for Boost.Math and Filesystem, but it's a big deal), but mostly so that the large static tables are an opt-in cost. The compiler throughput implications of the roughly half-megabyte of source code and static tables are significant, but my feeling is that this can be easily avoided by dragging <charconv> into a single translation unit which wraps to_chars() for higher-level consumption. For lexical_cast the concern would be greater (but on the other hand, it currently drags in iostreams).

Currently, the code that I've directly derived from Ryu (shipping in <xcharconv_ryu.h>) is Boost-licensed as it must be, but the rest of the charconv machinery (shipping in <charconv> and <xcharconv.h>, including from_chars() derived from our CRT) is not Boost-licensed (instead it is covered by the usual VS EULA). We are interested in open-sourcing all of the charconv implementation, potentially in a more standalone form that could be directly consumed by Boost or other Standard Library implementations with minimal changes, after I manage to complete the code.

Note that my implementation has one potentially significant limitation - it handles only 32-bit float and 64-bit double, as MSVC doesn't have a wider long double. That is, I haven't touched Ulf's https://github.com/ulfjack/ryu/blob/master/ryu/generic_128.c at all (nor is from_chars prepared to deal with larger types, although it is generic enough for its constants to be re-tuned).

That said, if there's a way I can make it easier for you and @StephanTLavavej to construct header-only libraries in a way that allows you to easily pull updates from here, that works for me.

Due to how I'm rebasing massively invasive changes on top of your master (converting each identifier to __ugly is mechanical and the least of the worries, actually), my main concern at the moment is with any large-scale code rearrangement that I need to keep up with. In principle I can digest anything, and I really want to avoid permanent divergence from upstream (hence my contributions of almost everything non-C++, non-MSVC, non-STL specific).

I guess I would start by asking: Vinnie, could we simplify this problem by simply layering Boost on top of std::to_chars()? If you need something outside of what to_chars() provides, that argues for changing the Standard specification. Downlevel support (older toolsets, or pre-C++17 modes) could be addressed by a standalone charconv implementation, or minor vendor extensions (e.g. we currently block charconv in C++14 mode, but we could just as easily permit it).

There is also the possibly-C++20 fmt library coming, which will also be layered on top of charconv (somewhat invasively, AFAICT). Is there a reason for Boost to duplicate that work?

vinniefalco commented 5 years ago

I do not understand what is being asked. Are you suggesting that Boost should build an implementation on top of std::to_chars which requires C++17? Or that pre-C++17 modes would only work with versions of MSVC which don't "block charconv in C++14 mode?" What happens with the older versions of libc++ and libstdc++?

The purpose of duplicating the work in Boost is to make it universally available (all compilers and stdlibs that Boost targets).

StephanTLavavej commented 5 years ago

Are you suggesting that Boost should build an implementation on top of std::to_chars which requires C++17?

Yeah. With to_chars possibly being provided as a "bolt-on" standalone implementation.

I'd like to know how many users are stuck on old toolsets or old Standard modes and can't easily upgrade, yet have access to the latest Boost. In MSVC we've done a lot of work to address the former (maintaining bincompat in the 2015/2017/2019 release series and updating far more frequently, so users are continuously on the latest release).

vinniefalco commented 5 years ago

I'd like to know how many users are stuck on old toolsets or old Standard modes and can't easily upgrade, yet have access to the latest Boost.

At least one user, me. Beast only requires C++11. I need all the GitHub stars I can get, and requiring C++17 is for the most part a death sentence with respect to obtaining a large user-base, unless your library is so incredibly useful that people will jump through any hoop in order to use it (I'm thinking TensorFlow here). Most libraries are not like that so we have to support C++11 which is still quite popular according to this developer survey: https://www.jetbrains.com/research/devecosystem-2018/cpp/

and this developer survey: https://isocpp.org/files/papers/CppDevSurvey-2018-02-summary.pdf

I will probably fork Beast to require a newer C++ some time in 2021 when the toolchains have settled down. 10 years is enough.

robhz786 commented 5 years ago

Hi all,

I'm in the same need as Vinnie. I just probably need it more. ( I'm working on a formatting library that I hope to propose to boost this year [1] ). Just a few days ago I started to study how that could be achieved, and then I saw this conversation.

Afaics, a header-only version of ryu would not be feasible in C.

The main reason is that it would be necessary to define all functions and all global variables as static, implying a copy in each translation unit during the build process. And that would be unacceptable given the size of those functions and especially of those lookup tables ryu has. In C++ that would be easily solved by defining those tables as static arrays inside inline functions. The linker would then ensure that there would be only one definition of each of them in the program. In, C however, inline functions don't work that way. [2]

The other issue is that a header only C library would expose all names into the global scope.

So this is how I was considering to do is to enable the current code to work in two modes: as a C compiled library and as a C++ header-only library. This implies adding some more #ifdef __cplusplus and to create some c++ headers like ryu_header_only.hpp.

I think the C++ header-only api should:

This seems possible. If no one is interested in doing that, I will give a try.

Btw, I have a version of the double_to_number function proposed by Vinnie here: https://github.com/robhz786/mini-cpp-ryu


[1] https://github.com/robhz786/stringify [2] https://blogs.oracle.com/d/inline-functions-in-c https://en.wikipedia.org/wiki/Inline_function#C99

vinniefalco commented 5 years ago

Some thoughts on above. I think, for the C++ version to be palatable it needs to support "separate compilation" as I described in my earlier reply: (https://github.com/ulfjack/ryu/issues/111#issuecomment-477373694). This way the user has a choice between the convenience of header only, and the performance of separate compilation.

The main challenge of a C++ port will be to keep the C and C++ versions in sync, as it is clear that there is considerable active development going on and no promise of code stability. For the tables, one way is to make a header file that contains only the table values and then it can be included inside a function:

inline auto get_fixed_tables() noexcept
{
    #include "fixed_tables_variable.hpp"

    return POW10_SPLIT;
}

With respect to the DOUBLE_MANTISSA_BITS, as long as it is properly named (e.g. BOOST_RYU_DOUBLE_MANTISSA_BITS) then it really isn't a big deal to leave it as a macro, because the value is only an implementation detail and not a public interface. It has the same performance as a constexpr value, the name has no conflicts, and if the upstream repository adopts the name change then we have the benefit of a smaller amount of forked header material.

robhz786 commented 5 years ago

Perhaps the C++ non-header-only version could be just a thin wrapper over the C api. For example, ryu.h would be like:

#ifdef __cplusplus

extern "C" int ryu_d2s_buffered_n(double f, char* result);
extern "C" void ryu_d2s_buffered(double f, char* result);
extern "C" char* ryu_d2s(double f);
/* etc ...*/     

RYU_NAMESPACE_BEGIN

inline int d2s_buffered_n(double f, char* result) {
    return ::ryu_d2s_buffered_n(f, result);
}
inline void d2s_buffered(double f, char* result) {
    return ::ryu_d2s_buffered_n(f, result);
}
inline char* d2s(double f) {
    return ::ryu_d2s(f);
}
// etc ...

RYU_NAMESPACE_END

#else /* #ifdef __cplusplus */

int ryu_d2s_buffered_n(double f, char* result);
void ryu_d2s_buffered(double f, char* result);
char* ryu_d2s(double f);
// etc ...    

#endif /* #ifdef __cplusplus #else */
ulfjack commented 5 years ago

It would be weird to have symbols starting with BOOST_ checked in here, and @StephanTLavavej might be unhappy about that, too. That said, I could be convinced to maintain a list of symbols and a customizable renaming script if that helps.

Using the same header for C and C++ and having it expose different APIs depending on context seems ... not ideal to me. I'd rather have separate headers if that's not too much of a problem.

vinniefalco commented 5 years ago

It would be weird to have symbols starting with BOOST_ checked in here

Agreed - the name I used was expository only. It could just as easy be prefixed with RYU_ULFJACK_. Such a prefix is virtually guaranteed not to have conflicts, unlike DOUBLE_MANTISSA_BITS.

StephanTLavavej commented 5 years ago

@robhz786

In C++ that would be easily solved by defining those tables as static arrays inside inline functions.

I take advantage of C++17 by leaving the tables at file scope and marking them as inline constexpr.

include C++ instead of C standard headers

It's perfectly fine for C++ code to include <stdint.h> instead of <cstdint>. (Vice versa is also perfectly fine.) There is no namespace cleanliness achieved by including <cstdint> because it's still permitted to define names in the global namespace followed by dragging them into namespace std, which is exactly what MSVC's implementation does.

@vinniefalco

The main challenge of a C++ port will be to keep the C and C++ versions in sync, as it is clear that there is considerable active development going on and no promise of code stability.

I'm achieving this by continual rebasing, although that probably isn't something that other people want to do, and unfortunately the result can't be ingested by non-Standard-Library-implementers (as all names are ugly, I assume C++17, etc.).

What are the other language ports doing - effectively perma-forking and not picking up upstream improvements?

@ulfjack

and @StephanTLavavej might be unhappy about that, too.

Kinda - it would be weird but it wouldn't be very problematic. I mechanically rename everything to prepend double underscores, so treating names specially is a minor complication, but not major. The only names that I currently special-case are umul128 and shiftright128 since adding double underscores creates collisions and near-collisions with the intrinsic names. If BOOST names appeared, I would either rename it away (likely), or let the addition of double underscores prevent any conflicts.

Renames to make the macro constants more collision-resistant by adding more RYU etc. would be welcome. (I additionally change them to be inline constexpr variables instead of macros, which doesn't affect usage.)

robhz786 commented 5 years ago

Hi all,

created pull request to enable header-only mode.

vinniefalco commented 4 years ago

I have made tremendous progress on a port of ryu to C++, which can be either compiled as a separate static or dynamic library (default) or as header-only (by setting a macro).

ulfjack commented 4 years ago

Cool! Do you have that in a place that we could look at? Also, are you planning to upstream any of your changes?

vinniefalco commented 4 years ago

Yes: https://github.com/vinniefalco/json/tree/develop/include/boost/json/detail/ryu

This is in detail/ because it is not a public API. I could have done it with fewer diffs I think (e.g. keep using stdint.h instead of cstdint).

Unfortunately I can't submit these changes to the upstream. Header-only C++ is fundamentally incompatible with C.

vinniefalco commented 4 years ago

Here's an example of the difference from d2fixed_full_table.h. Here's an original table declaration:

static const uint16_t POW10_OFFSET[TABLE_SIZE] = {
  0, 2, 5, 8, 12, 16, 21, 26, 32, 39,
...

Here is the C++ declaration

inline
uint16_t const
(&POW10_OFFSET() noexcept)[64]
{
    static constexpr uint16_t arr[64] = {
           0,   2,   5,   8,  12,  16,  21,  26,  32,  39,
...

In theory this could be accomplished with a macro although in C++ the table is accessed by calling a function, e.g. POW10_OFFSET()[3]. We could add another macro for that. So maybe something like

#ifdef __cplusplus__
#define DECLARE_TABLE( name, type, size, values ) inline type const (&name() noexcept)[size] { \
    static constexpr type arr[size] = { values }; return arr;
#else
#define DECLARE_TABLE( name, type, size ) static const type name[size] = { values };
#endif

Could this work?

ulfjack commented 4 years ago

Yes, declaring macros for table definition and access sounds reasonable.