microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.54k stars 3.82k forks source link

Serializers convert floats to strings #4681

Open david-cortes opened 2 years ago

david-cortes commented 2 years ago

In the R and C interfaces of LightGBM, the serialization functions convert floating point numbers to text representations in decimal format in order to save and load models (e.g. LGBM_BoosterSaveModel and LGBM_BoosterSaveModelToString). This loses precision in floating point numbers and can lead to small differences in predictions between a model used right after fitting it and a model that was loaded from a saved file or string.

As an additional problem, serializing through this system also means that it's not possible to know the size that the serialized bytes will have beforehand (that is, the model needs to be serialized in order to know how long will the buffer than holds need to be), which makes serialization in the C interface less efficient.

jameslamb commented 2 years ago

Thanks for writing this up!

Can you please provide some evidence for the problem you're talking about?

Without that type of information, maintainers here will have to put some effort into figuring out things that I suspect you already know. Sharing such information could help us get to a fix more quickly.

david-cortes commented 2 years ago

You can confirm it in lines like these: https://github.com/microsoft/LightGBM/blob/b1b6db4bd72746afa41ce9f23683937e872ecc2b/src/io/tree.cpp#L352

Alternatively, you can save the model to a file and inspect it in a text editor to see that values are saved in a decimal representation.

jameslamb commented 2 years ago

Thanks! We'd need to test this, but do you think this issue could be the cause of #4680?

david-cortes commented 2 years ago

Don't know. Could be, if lgb.Dataset does the same thing, but haven't looked at its code. Although in that case, if it is only failing for windows, you could perhaps try playing with mingw's __USE_MINGW_ANSI_STDIO.

jameslamb commented 2 years ago

got it, thanks for the tip!

StrikerRUS commented 2 years ago

Do you think we need a binary serialization format in addition to current text one?

Related: #4217.

There was unsuccessful attempt to adopt protobuf format many years ago: #372, #908 (reverted from master later).

Also related, our friends at XGBoost are trying to adopt Universal Binary JSON format as a binary serialization format: dmlc/xgboost#7545.

david-cortes commented 2 years ago

I think it'd be a good addition, since apart from ruling out any potential case of mismatches between fresh and deserialized models, it would imply decreased memory usage and faster serialization/deserialization (especially important for the R interface since it now keeps a serialized copy by default).

However, I think it'd be enough with simply using a custom format by memcpying arrays and struct fields to a char* pointer. There's also the "cereal" library which auto-generates code for doing that using ostreams (like std::stringstream), but such an approach would have the downside of losing compatibility with earlier and future library versions with a different model structure.

trivialfis commented 2 years ago

but such an approach would have the downside of losing compatibility with earlier and future library versions with a different model structure.

It might be desirable to have a schema otherwise it can be difficult to debug compatibility issues in the future.