Open jwyyy opened 2 years ago
Hey @jwyyy ! Thanks for your interest in LightGBM!
Could you please clarify why joblib
, a de-facto standard save/load mechanism in scikit-learn ecosystem, doesn't work for you?
Hi @StrikerRUS, thank you so much for your reply!
Could you please clarify why
joblib
, a de-facto standard save/load mechanism in scikit-learn ecosystem, doesn't work for you?
joblib
definitely works in serializing / deserializing models.
I think my point is whether it is a better idea to unify the model saving / loading APIs. After all, boosters can be saved by calling save_model()
. But for sklearn models, saving and loading utilize a different routine. This leads to different saved formats depending on which models are trained. A saved booster is a simple text file. It allows users to see every trained detail in a plain text reader, which a model saved via pickle
/ joblib
doesn't have.
Another example would be the linked PR I am currently working on. Currently the model saving routine needs to be handled differently based on logged model types (L146 doesn't work for all LightGBM models).
Please let me know your suggestions and which approach is preferred (b/c it would affect how I should implement the autologging for LightGBM sklearn models). Thank you very much for your feedback and ideas!
But for sklearn models, saving and loading utilize a different routine. This leads to different saved formats depending on which models are trained.
I believe this is OK for different ecosystems.
A saved booster is a simple text file. It allows users to see every trained detail in a plain text reader, which a model saved via
pickle
/joblib
doesn't have.
In case you need a text model representation, you can easily get access to the underlying Booster class via booster_
property of sklearn model.
To implement text save/load behavior for sklearn interface label transformations should be saved as well as basic model:
@StrikerRUS Thank you for your response! It helps me understand how LightGBM sklearn APIs are designed.
In case you need a text model representation, you can easily get access to the underlying Booster class via
booster_
property of sklearn model.
The purpose here is not to obtain a text representation. It is what modes of model saving / loading can make sklearn APIs easier to use.
[[Python] How to create LGBMClassifier from booster? #1942](https://github.com/microsoft/LightGBM/issues/1942)
I agree that it is one way to reload a sklearn model. However, I am not sure it is a good practice in general. First, it needs to access internal Python class members starting with _
, which is not recommended for safety reasons. Second, the predict()
method of the reloaded object doesn't have the same behavior as the original one: according to the example, the reloaded method returns a vector of predicted class probabilities, whereas the original method should output class labels directly (label encoders are not saved, and so L1038 doesn't work in the reloaded model).
I understand that there are definitely multiple ways to save / load sklearn models. My goal was to see if there is any plan in the community to unify all Python models' saving / loading APIs, including Boosters and other sklearn models.
Summary
Currently all sklearn models don't support saving / loading models directly via APIs such as
save_model()
andload_model()
. We need to get Boosters from sklearn models first and then save it as a Booster.I am willing to contribute to this feature under the guidance of the lightGBM community.
Motivation
It would be convenient to save / load sklearn models directly.
Description
Add
save_model()
andload_model()
to sklearn models.References
xgboost.sklearn
supports model saving and loading. (Essentially we save / load models as Boosters, but need to address sklearn model specification.)A related PR https://github.com/microsoft/LightGBM/pull/4802.