AlbertoEAF commented 4 years ago

Hello,

Issue

With the revert of its PR, this issue became again a concern in Java: https://github.com/microsoft/LightGBM/issues/2890.

At the moment LightGBM uses strtod() and C++ streams to read/write floating point the model file numbers. The issue is that Java breaks the initialization to the standard "C" locale (see https://github.com/microsoft/LightGBM/pull/2891). Current evidence shows this is true for Java providers that use the C API through SWIG (I suspect MMLSpark's LightGBM might be affected by this too), and even for users of the LightGBM Python API that have their code wrapped by JEP.

For production scenarios ensuring the correctness of scores is of paramount importance and right now there are none for LGBM providers that use Java (see issue above).

Request for Change

After assessing alternatives, and seeing that the last fix was reverted, the robust solution would be to change the model read/write of floating numbers through use of process-locale-independent methods.

Implementation proposal

Replace floating point read/writes in model save/load with calls to C++ streams (we already use them anyway for some parameters). We imbue such streams with the "C" locale, to force the same locale for reads and writes. This fixes our bug.

It might be slightly slower to read/write the model but it will be correct everytime independently of process locale settings. According to the benchmark https://tinodidriksen.com/2011/05/cpp-convert-string-to-double-speed/ it would result in a ~3x slower floating-point conversion (reusing streams). I doubt it's actually 3x slower (see tricks to make it faster http://cplusplus.bordoon.com/speeding_up_string_conversions.html). Besides, we're already using streams for some numbers, thus the model read/write slowdown wouldn't reach that value.

Is that an issue?

Can I get your thoughts on this? Good, bad? Secondary effects I might be missing? Would such PR be welcome? @guolinke @StrikerRUS @imatiach-msft et al.

Thank you!

Not enough? Future - Alternative libraries

In the future, if needed we can upgrade to use different libraries like Boost.Spirit or David M. Gay's library which is more accurate than glibc (current implementation). If we do so, we can have rounding differences (that in principle are small).

Side note on floating-point conversion precision and David Gay's library

GCC uses glibc to compile LightGBM, and according to these links:

Converting decimal strings to doubles correctly and efficiently is a surprisingly complex task; fortunately, David M. Gay did this for us when he wrote this paper and released this code over 20 years ago. (He maintains this reference copy of strtod() to this day.) His code appears in many places, including in the Python, PHP, and Java programming languages, and in the Firefox, Chrome, and Safari Web browsers.

Many programming environments implement their string to double conversions with David Gay’s strtod(); glibc, the GNU C Library, does not.

David Gay's implementation https://www.ampl.com/netlib/fp/dtoa.c of to/from string/floating point conversion is even more accurate than we're using now. It is a single file that can be compiled and configured through preprocessor flags, in our case to ignore locale processing altogether (resulting in the "C" locale behaviour for reads and writes).

StrikerRUS commented 4 years ago

Speaking about float2str conversion in saving models, XGBoost recently adopted Ryū algorithm (https://github.com/dmlc/xgboost/pull/5772) and numpy uses Dragon4 algorithm (https://github.com/numpy/numpy/pull/9941).

AlbertoEAF commented 4 years ago

Before I commit to working on this, can someone help me direct me to the files where load/save model files code is found?

Searching for the code... (read only if at a loss)

The fix we had posted in the past https://github.com/microsoft/LightGBM/pull/2891/files required changes in: LGBM_BoosterCreateFromModelfile, LGBM_BoosterSaveModel, LGBM_BoosterSaveModelToString and LGBM_BoosterDumpModel in the C API.

This new implementations will need to change the core code, which means fixes must be done in the functions called therein, namely, Booster::Booster, Booster::SaveModelToFile, Booster::SaveModelToString and Booster::DumpModel.

Reading model files seems to be implemented in:

boosting_->LoadModelFromString(model_str, len) which leads me to GBDT::LoadModelFromString in gbdt_model_text.cpp. In turn, that uses both ArrayToString and ArrayToStringFast (don't know why we have and use both to save the model - legacy?). To write stuff Common::DoubleToStr, which uses snprintf (which is not locale-independent) is used.

Conclusion and questions

I got lost on the virtual function pointers, though, I was expecting to see DART and others implement model read/write.

Can I thus assume model text file read/writes are implemented in gbdt_model_text.cpp's:

methods alone?

If I change that file and functions called in it like Common::DoubleToStr everything should work, or should I look for additional code in other places?
What about DART and other boosting modes, all of them use gbdt_model_text.cpp as well to implement the model file read/writes?

+1 ping for @guolinke as you wrote most of that code

Thank you all :)

StrikerRUS commented 4 years ago

Gently ping @guolinke for two questions above about the code location.

guolinke commented 4 years ago

sorry for missing this thread... @AlbertoEAF

I remember ArrayToStringFast is only for integer. So for model_to_string, changing DoubleToStr should be enough. For for the string_to_model, there are several functions. like https://github.com/microsoft/LightGBM/blob/adfc9f5c61ad3d11b9768ed0b9104297434ce822/include/LightGBM/utils/common.h#L513-L518 and https://github.com/microsoft/LightGBM/blob/adfc9f5c61ad3d11b9768ed0b9104297434ce822/include/LightGBM/utils/common.h#L566-L574
yeah, all models share the same model input/output codes.

AlbertoEAF commented 4 years ago

Thanks so much! I will start working on it ;)

AlbertoEAF commented 4 years ago

By the way @StrikerRUS , regarding such libraries, I think I'll go with Ryü, I've been reading up a bit on it. Both came up after David Gay's implementations and Ryü came after Dragon4. Dragon4 has a fallback algorithm for some numbers which is slower. I also saw even faster implementations but I don't want to get too close to the edge as this is supposed to be stable.

I was thinking of using this C++ implementation of Ryü https://github.com/expnkx/fast_io as it provides a single-header to include but it requires C++20 when LGBM is only C++11.

This leaves only the original implementation with C interface: https://github.com/ulfjack/ryu. I hope it's easy - and possible - to add to our build chain, still will have to figure that one out.

guolinke commented 4 years ago

@AlbertoEAF BTW, is that possible to have the two-stage solution? like, have a script to locale/de-locale the model? LightGBM itself could read/write always by current format. But when users want to read the content of model file, we can locale it.

AlbertoEAF commented 4 years ago

Hello @guolinke , I'm not sure what you mean. Why would we want to transform the model content to different locales? The model should behave like a "binary blob" which LightGBM can read and write correctly independently of the machine settings. It's cleaner to have it store and load always with the same format, we don't need models in different locales. Do you agree? :)

Also, I believe it would be even trickier as you would have to know precisely the locale with which the model was written to read it correctly, which can be tricky to know in some corner cases.

Finally, as LightGBM itself does not allow "changing locale" during its execution, we would have to create an external tool, and that tool would need to parse the model file structure to read it and then rewrite it with exactly the same structure - lots of code duplication and complexity.

guolinke commented 4 years ago

@AlbertoEAF I misunderstood the locale problem. you are right, the model should be locale-independent.

AlbertoEAF commented 4 years ago

Hello, I have 2 requests for input @StrikerRUS @guolinke:

Maybe there were changes meanwhile because in the v3.0.0 code ArrayToStringFast and ArrayToString both operate on doubles. ArrayToString only operates on doubles, but ArrayToStringFast is a template that processes arrays of any type. I believe there's no reason to have both anymore.
Second point, instead of using Ryu, I was thinking of using fmt https://fmt.dev/latest/index.html, for its integration is much easier, and still 10x faster than the standard libraries (current state). Besides being the foundation to C++20's new format standard library, it is also very well maintained, uses only C++11 features and it has explicit support for gcc 4.8 (https://fmt.dev/latest/index.html). What do you think? Also it seems much easier to use than Ryu and in this thread they say it's not much different in terms of speed: https://github.com/fmtlib/fmt/issues/1426. Update: forgot about the reading part.. Found scnlib with a similar API in github but it's not stable.

Thanks!

AlbertoEAF commented 4 years ago

Wow, I replaced threshold, leaf_value and leaf_weight's ArrayToString calls by ArrayToStringFast in src/io/tree.cpp's Tree::ToString() and had surprising results. To test I read a reference model file and wrote it back (left=v3.0.0 code, right=modified code):

Whilst ArrayToStringFast produces numbers with less precision which saves space, the tree sizes also changed. Why is that?!

StrikerRUS commented 4 years ago

@AlbertoEAF I'm totally OK with using fmt library! It looks quite popular based on number of GitHub stars and we can use it as a git submodule (like boosts' compute).

StrikerRUS commented 4 years ago

Hah, just noticed that treelite project is using fmt library. And treelite in turn is used in rapidsai's project for tree models: https://github.com/dmlc/treelite/blob/1c495c52a6dff8413f1b66f2f0ecc4e390473c5e/cmake/ExternalLibs.cmake#L10-L14 https://github.com/rapidsai/cuml

AlbertoEAF commented 4 years ago

Yeah @StrikerRUS , even integer to string conversions might benefit from it, not only floating point conversions with shortest representation. LightGBM already has a custom Atoi function built to improve performance I believe, but it seems not even that will beat this:

http://www.zverovich.net/2020/06/13/fast-int-to-string-revisited.html

AlbertoEAF commented 4 years ago

@guolinke @StrikerRUS I've hit some issue with the LightGBM model (same as the one above with the image) which I don't understand and really need your help. Any ideas of what might be causing this are greatly appreciated!

Issue

When I switch the call that turns floating point parameters to strings in the model write, I get numbers which have sometimes a redundant trailing 0 in the scientific notation which shouldn't matter - just posted a bug at fmtlib: https://github.com/fmtlib/fmt/issues/1873. See it to understand what I mean.

The big issue though is that for each of the two trees where one parameter changed, that tree_size in the model grew by +1 as well! Any ideas?!

Diffs in the output models files produced by LightGBM. Each image reports the line difference with the "unpatched"/normal LightGBM code (in red) and with the patch (in green), followed by the differences in items from such lines also in red/green:

In this case the model written with the patched code results in two trees where one parameter as a trailing 0, which results in this tree size changes:

and that's what I don't understand. Why do the tree sizes change because of the written floating point parameters?

Source code references

If you want to take a look at the difference in the code that led to this issue reported here and in https://github.com/fmtlib/fmt/issues/1873, the code change is this one: https://github.com/feedzai/LightGBM/pull/2/commits/da3d8f2397d87fcb784c854564a830763a24048d.

The images above were generated with the two commits:

red: The "base" code which works exactly like LightGBM v3.0.0 - https://github.com/feedzai/LightGBM/pull/2/commits/f65512fa9753172cc9e0728c4be67dda8a4f2fa8
green: The patched code - https://github.com/feedzai/LightGBM/pull/2/commits/da3d8f2397d87fcb784c854564a830763a24048d

PR's in progress (still in very early development):

LightGBM PR: https://github.com/feedzai/LightGBM/pull/2
Provider which uses LightGBM and where I have all test resources and functional and integration test code: https://github.com/feedzai/feedzai-openml-java/pull/53

Thank you! Any input is appreciated!

StrikerRUS commented 4 years ago

@guolinke The issue described above can be the root cause of our previous issue when maintainers of Julia package had to remove tree_sizes field to load a model created on another machine: #3137.

AlbertoEAF commented 4 years ago

To further reinforce the "bug" origin, I added code to strip away trailing 0's in the decimal digits of numbers, and it already works properly - i.e., there are no differences in the output models! (This is the temporary hack I proposed on the https://github.com/fmtlib/fmt/issues/1873).

This commit https://github.com/feedzai/LightGBM/pull/2/commits/c34c43da599b1bf60884d3576970103c26cc6204 adds that code (which is not a permanent solution as it's a hack which reduces speed and is not 'pretty').

AlbertoEAF commented 4 years ago

I was already able to replace all model write calls except DoubleToStr, again due to divergences in fmt and snprintf, this time related to output format being with fixed precision (%.17g). However there's a hack to fix it too until they fix it in the library: truncating the output string to the maximum number of significant digits: https://github.com/fmtlib/fmt/issues/1875

guolinke commented 4 years ago

@AlbertoEAF the tree-size is the string size of tree model content. We use it for the multi-threading loading tree models.

AlbertoEAF commented 4 years ago

Ah! So it's no bug! I get it now, shouldn't be a problem then, thanks!

AlbertoEAF commented 4 years ago

Hello @guolinke @StrikerRUS I have good news, I've completed the code changes and it's passing all the tests!

Instead of relying on Ryu, actually I used fmt to write values to strings as it provides similar performance and its API is much more inline with the rest of the LightGBM codebase. To do string->double parsing I am using this library: https://github.com/lemire/fast_double_parser which provides maximal floating-point string parsing resolution and matches the standard library parsing outputs bit by bit.

Hence, transforming data to strings and string arrays should now be faster. Parsing strings to doubles with StringToArray should now be significantly around 10x faster at parsing doubles, according to their benchmarks.

I'd like to streamline the code more by merging Common's __StringToTHelperFast and __StringToTHelper, however, Common::Atof seems to diverge from standard library's stringstreams and fast_double_parser's resolutions. Those two are equivalent and provide the maximal parsing resolution. Common::Atof string->double parsing diverges around the 2 last decimal places. See some examples:

And seems to have less resolution:

As such I had to leave both __StringToTHelperFast and __StringToTHelper separate to maintain the current LightGBM behaviour without affecting read/write floating-point resolution.

StrikerRUS commented 4 years ago

Common::Atof string->double parsing diverges around the 2 last decimal places. See some examples:

Looks like the difference is in the 18th+ digit in all examples. And according to IEEE 754, double gives only 15-17 first significant digits. So, I believe it is OK to have inconsistent digits after 17th place.

AlbertoEAF commented 4 years ago

Hmm that's interesting, in that case we could replace Atof and further simplify the codebase :)

Thanks Nikita! I'll run some tests tomorrow and see if the scores don't change.

guolinke commented 4 years ago

I remember that atof is used for the precision non-sensitive conversion (it will be faster), like the split_gain. If the new solution is faster and more precisious, we can switch to it.

StrikerRUS commented 3 years ago

Fixed via #3405. Thanks a lot @AlbertoEAF !

AlbertoEAF commented 3 years ago

Thanks a lot for all the help @StrikerRUS @jameslamb ;)

github-actions[bot] commented 1 year ago

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

microsoft / LightGBM

[fix] Locale independent model load/save #3267

Issue

Request for Change

Implementation proposal

Not enough? Future - Alternative libraries

Side note on floating-point conversion precision and David Gay's library

Searching for the code... (read only if at a loss)

Conclusion and questions

Issue

Source code references