Open david-cortes opened 2 years ago
Thanks for writing this up!
Can you please provide some evidence for the problem you're talking about?
LGBM_BoosterSaveModel
, where you think this precision loss is happening?Without that type of information, maintainers here will have to put some effort into figuring out things that I suspect you already know. Sharing such information could help us get to a fix more quickly.
You can confirm it in lines like these: https://github.com/microsoft/LightGBM/blob/b1b6db4bd72746afa41ce9f23683937e872ecc2b/src/io/tree.cpp#L352
Alternatively, you can save the model to a file and inspect it in a text editor to see that values are saved in a decimal representation.
Thanks! We'd need to test this, but do you think this issue could be the cause of #4680?
Don't know. Could be, if lgb.Dataset
does the same thing, but haven't looked at its code. Although in that case, if it is only failing for windows, you could perhaps try playing with mingw's __USE_MINGW_ANSI_STDIO
.
got it, thanks for the tip!
Do you think we need a binary serialization format in addition to current text one?
Related: #4217.
There was unsuccessful attempt to adopt protobuf format many years ago: #372, #908 (reverted from master
later).
Also related, our friends at XGBoost are trying to adopt Universal Binary JSON format as a binary serialization format: dmlc/xgboost#7545.
I think it'd be a good addition, since apart from ruling out any potential case of mismatches between fresh and deserialized models, it would imply decreased memory usage and faster serialization/deserialization (especially important for the R interface since it now keeps a serialized copy by default).
However, I think it'd be enough with simply using a custom format by memcpying arrays and struct fields to a char*
pointer. There's also the "cereal" library which auto-generates code for doing that using ostreams
(like std::stringstream
), but such an approach would have the downside of losing compatibility with earlier and future library versions with a different model structure.
but such an approach would have the downside of losing compatibility with earlier and future library versions with a different model structure.
It might be desirable to have a schema otherwise it can be difficult to debug compatibility issues in the future.
In the R and C interfaces of LightGBM, the serialization functions convert floating point numbers to text representations in decimal format in order to save and load models (e.g.
LGBM_BoosterSaveModel
andLGBM_BoosterSaveModelToString
). This loses precision in floating point numbers and can lead to small differences in predictions between a model used right after fitting it and a model that was loaded from a saved file or string.As an additional problem, serializing through this system also means that it's not possible to know the size that the serialized bytes will have beforehand (that is, the model needs to be serialized in order to know how long will the buffer than holds need to be), which makes serialization in the C interface less efficient.