Closed AlbertoEAF closed 3 years ago
Speaking about float2str conversion in saving models, XGBoost recently adopted Ryū algorithm (https://github.com/dmlc/xgboost/pull/5772) and numpy uses Dragon4 algorithm (https://github.com/numpy/numpy/pull/9941).
Before I commit to working on this, can someone help me direct me to the files where load/save model files code is found?
The fix we had posted in the past https://github.com/microsoft/LightGBM/pull/2891/files required changes in: LGBM_BoosterCreateFromModelfile, LGBM_BoosterSaveModel, LGBM_BoosterSaveModelToString and LGBM_BoosterDumpModel in the C API.
This new implementations will need to change the core code, which means fixes must be done in the functions called therein, namely, Booster::Booster
, Booster::SaveModelToFile
, Booster::SaveModelToString
and Booster::DumpModel
.
Reading model files seems to be implemented in:
boosting_->LoadModelFromString(model_str, len)
which leads me to GBDT::LoadModelFromString
in gbdt_model_text.cpp
.
In turn, that uses both ArrayToString
and ArrayToStringFast
(don't know why we have and use both to save the model - legacy?).
To write stuff Common::DoubleToStr
, which uses snprintf
(which is not locale-independent) is used.
I got lost on the virtual function pointers, though, I was expecting to see DART and others implement model read/write.
Can I thus assume model text file read/writes are implemented in gbdt_model_text.cpp
's:
methods alone?
gbdt_model_text.cpp
as well to implement the model file read/writes?+1 ping for @guolinke as you wrote most of that code
Thank you all :)
Gently ping @guolinke for two questions above about the code location.
sorry for missing this thread... @AlbertoEAF
I remember ArrayToStringFast
is only for integer. So for model_to_string, changing DoubleToStr should be enough. For for the string_to_model, there are several functions. like https://github.com/microsoft/LightGBM/blob/adfc9f5c61ad3d11b9768ed0b9104297434ce822/include/LightGBM/utils/common.h#L513-L518 and https://github.com/microsoft/LightGBM/blob/adfc9f5c61ad3d11b9768ed0b9104297434ce822/include/LightGBM/utils/common.h#L566-L574
yeah, all models share the same model input/output codes.
Thanks so much! I will start working on it ;)
By the way @StrikerRUS , regarding such libraries, I think I'll go with Ryü, I've been reading up a bit on it. Both came up after David Gay's implementations and Ryü came after Dragon4. Dragon4 has a fallback algorithm for some numbers which is slower. I also saw even faster implementations but I don't want to get too close to the edge as this is supposed to be stable.
I was thinking of using this C++ implementation of Ryü https://github.com/expnkx/fast_io as it provides a single-header to include but it requires C++20 when LGBM is only C++11.
This leaves only the original implementation with C interface: https://github.com/ulfjack/ryu. I hope it's easy - and possible - to add to our build chain, still will have to figure that one out.
@AlbertoEAF BTW, is that possible to have the two-stage solution? like, have a script to locale/de-locale the model? LightGBM itself could read/write always by current format. But when users want to read the content of model file, we can locale it.
Hello @guolinke , I'm not sure what you mean. Why would we want to transform the model content to different locales? The model should behave like a "binary blob" which LightGBM can read and write correctly independently of the machine settings. It's cleaner to have it store and load always with the same format, we don't need models in different locales. Do you agree? :)
Also, I believe it would be even trickier as you would have to know precisely the locale with which the model was written to read it correctly, which can be tricky to know in some corner cases.
Finally, as LightGBM itself does not allow "changing locale" during its execution, we would have to create an external tool, and that tool would need to parse the model file structure to read it and then rewrite it with exactly the same structure - lots of code duplication and complexity.
@AlbertoEAF I misunderstood the locale problem. you are right, the model should be locale-independent.
Hello, I have 2 requests for input @StrikerRUS @guolinke:
Maybe there were changes meanwhile because in the v3.0.0 code ArrayToStringFast and ArrayToString both operate on doubles. ArrayToString only operates on doubles, but ArrayToStringFast is a template that processes arrays of any type. I believe there's no reason to have both anymore.
Second point, instead of using Ryu, I was thinking of using fmt https://fmt.dev/latest/index.html, for its integration is much easier, and still 10x faster than the standard libraries (current state). Besides being the foundation to C++20's new format standard library, it is also very well maintained, uses only C++11 features and it has explicit support for gcc 4.8 (https://fmt.dev/latest/index.html). What do you think? Also it seems much easier to use than Ryu and in this thread they say it's not much different in terms of speed: https://github.com/fmtlib/fmt/issues/1426. Update: forgot about the reading part.. Found scnlib with a similar API in github but it's not stable.
Thanks!
Wow, I replaced threshold, leaf_value and leaf_weight's ArrayToString calls by ArrayToStringFast in src/io/tree.cpp's Tree::ToString() and had surprising results. To test I read a reference model file and wrote it back (left=v3.0.0 code, right=modified code):
Whilst ArrayToStringFast produces numbers with less precision which saves space, the tree sizes also changed. Why is that?!
@AlbertoEAF I'm totally OK with using fmt
library! It looks quite popular based on number of GitHub stars and we can use it as a git submodule (like boosts' compute
).
Hah, just noticed that treelite
project is using fmt
library. And treelite
in turn is used in rapidsai
's project for tree models:
https://github.com/dmlc/treelite/blob/1c495c52a6dff8413f1b66f2f0ecc4e390473c5e/cmake/ExternalLibs.cmake#L10-L14
https://github.com/rapidsai/cuml
Yeah @StrikerRUS , even integer to string conversions might benefit from it, not only floating point conversions with shortest representation. LightGBM already has a custom Atoi
function built to improve performance I believe, but it seems not even that will beat this:
@guolinke @StrikerRUS I've hit some issue with the LightGBM model (same as the one above with the image) which I don't understand and really need your help. Any ideas of what might be causing this are greatly appreciated!
When I switch the call that turns floating point parameters to strings in the model write, I get numbers which have sometimes a redundant trailing 0 in the scientific notation which shouldn't matter - just posted a bug at fmtlib: https://github.com/fmtlib/fmt/issues/1873. See it to understand what I mean.
The big issue though is that for each of the two trees where one parameter changed, that tree_size in the model grew by +1 as well! Any ideas?!
Diffs in the output models files produced by LightGBM. Each image reports the line difference with the "unpatched"/normal LightGBM code (in red) and with the patch (in green), followed by the differences in items from such lines also in red/green:
In this case the model written with the patched code results in two trees where one parameter as a trailing 0, which results in this tree size changes:
and that's what I don't understand. Why do the tree sizes change because of the written floating point parameters?
If you want to take a look at the difference in the code that led to this issue reported here and in https://github.com/fmtlib/fmt/issues/1873, the code change is this one: https://github.com/feedzai/LightGBM/pull/2/commits/da3d8f2397d87fcb784c854564a830763a24048d.
The images above were generated with the two commits:
PR's in progress (still in very early development):
Thank you! Any input is appreciated!
@guolinke The issue described above can be the root cause of our previous issue when maintainers of Julia package had to remove tree_sizes
field to load a model created on another machine: #3137.
To further reinforce the "bug" origin, I added code to strip away trailing 0's in the decimal digits of numbers, and it already works properly - i.e., there are no differences in the output models! (This is the temporary hack I proposed on the https://github.com/fmtlib/fmt/issues/1873).
This commit https://github.com/feedzai/LightGBM/pull/2/commits/c34c43da599b1bf60884d3576970103c26cc6204 adds that code (which is not a permanent solution as it's a hack which reduces speed and is not 'pretty').
I was already able to replace all model write calls except DoubleToStr
, again due to divergences in fmt and snprintf, this time related to output format being with fixed precision (%.17g). However there's a hack to fix it too until they fix it in the library: truncating the output string to the maximum number of significant digits: https://github.com/fmtlib/fmt/issues/1875
@AlbertoEAF the tree-size is the string size of tree model content. We use it for the multi-threading loading tree models.
Ah! So it's no bug! I get it now, shouldn't be a problem then, thanks!
Hello @guolinke @StrikerRUS I have good news, I've completed the code changes and it's passing all the tests!
Instead of relying on Ryu, actually I used fmt to write values to strings as it provides similar performance and its API is much more inline with the rest of the LightGBM codebase. To do string->double parsing I am using this library: https://github.com/lemire/fast_double_parser which provides maximal floating-point string parsing resolution and matches the standard library parsing outputs bit by bit.
Hence, transforming data to strings and string arrays should now be faster. Parsing strings to doubles with StringToArray
should now be significantly around 10x faster at parsing doubles, according to their benchmarks.
I'd like to streamline the code more by merging Common's __StringToTHelperFast
and __StringToTHelper
, however, Common::Atof
seems to diverge from standard library's stringstreams and fast_double_parser's resolutions. Those two are equivalent and provide the maximal parsing resolution.
Common::Atof
string->double parsing diverges around the 2 last decimal places. See some examples:
And seems to have less resolution:
As such I had to leave both __StringToTHelperFast
and __StringToTHelper
separate to maintain the current LightGBM behaviour without affecting read/write floating-point resolution.
Common::Atof
string->double parsing diverges around the 2 last decimal places. See some examples:
Looks like the difference is in the 18th+ digit in all examples. And according to IEEE 754, double gives only 15-17 first significant digits. So, I believe it is OK to have inconsistent digits after 17th place.
Hmm that's interesting, in that case we could replace Atof
and further simplify the codebase :)
Thanks Nikita! I'll run some tests tomorrow and see if the scores don't change.
I remember that atof
is used for the precision non-sensitive conversion (it will be faster), like the split_gain.
If the new solution is faster and more precisious, we can switch to it.
Fixed via #3405. Thanks a lot @AlbertoEAF !
Thanks a lot for all the help @StrikerRUS @jameslamb ;)
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.
Hello,
Issue
With the revert of its PR, this issue became again a concern in Java: https://github.com/microsoft/LightGBM/issues/2890.
At the moment LightGBM uses strtod() and C++ streams to read/write floating point the model file numbers. The issue is that Java breaks the initialization to the standard "C" locale (see https://github.com/microsoft/LightGBM/pull/2891). Current evidence shows this is true for Java providers that use the C API through SWIG (I suspect MMLSpark's LightGBM might be affected by this too), and even for users of the LightGBM Python API that have their code wrapped by JEP.
For production scenarios ensuring the correctness of scores is of paramount importance and right now there are none for LGBM providers that use Java (see issue above).
Request for Change
After assessing alternatives, and seeing that the last fix was reverted, the robust solution would be to change the model read/write of floating numbers through use of process-locale-independent methods.
Implementation proposal
Replace floating point read/writes in model save/load with calls to C++ streams (we already use them anyway for some parameters). We imbue such streams with the "C" locale, to force the same locale for reads and writes. This fixes our bug.
It might be slightly slower to read/write the model but it will be correct everytime independently of process locale settings. According to the benchmark https://tinodidriksen.com/2011/05/cpp-convert-string-to-double-speed/ it would result in a ~3x slower floating-point conversion (reusing streams). I doubt it's actually 3x slower (see tricks to make it faster http://cplusplus.bordoon.com/speeding_up_string_conversions.html). Besides, we're already using streams for some numbers, thus the model read/write slowdown wouldn't reach that value.
Is that an issue?
Can I get your thoughts on this? Good, bad? Secondary effects I might be missing? Would such PR be welcome? @guolinke @StrikerRUS @imatiach-msft et al.
Thank you!
Not enough? Future - Alternative libraries
In the future, if needed we can upgrade to use different libraries like Boost.Spirit or David M. Gay's library which is more accurate than glibc (current implementation). If we do so, we can have rounding differences (that in principle are small).
Side note on floating-point conversion precision and David Gay's library
GCC uses glibc to compile LightGBM, and according to these links:
David Gay's implementation https://www.ampl.com/netlib/fp/dtoa.c of to/from string/floating point conversion is even more accurate than we're using now. It is a single file that can be compiled and configured through preprocessor flags, in our case to ignore locale processing altogether (resulting in the "C" locale behaviour for reads and writes).