microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.53k stars 3.82k forks source link

[python][sklearn] Add model saving / loading to sklearn models #4841

Open jwyyy opened 2 years ago

jwyyy commented 2 years ago

Summary

Currently all sklearn models don't support saving / loading models directly via APIs such as save_model() and load_model(). We need to get Boosters from sklearn models first and then save it as a Booster.

I am willing to contribute to this feature under the guidance of the lightGBM community.

Motivation

It would be convenient to save / load sklearn models directly.

Description

Add save_model() and load_model() to sklearn models.

References

xgboost.sklearn supports model saving and loading. (Essentially we save / load models as Boosters, but need to address sklearn model specification.)

A related PR https://github.com/microsoft/LightGBM/pull/4802.

StrikerRUS commented 2 years ago

Hey @jwyyy ! Thanks for your interest in LightGBM!

Could you please clarify why joblib, a de-facto standard save/load mechanism in scikit-learn ecosystem, doesn't work for you?

jwyyy commented 2 years ago

Hi @StrikerRUS, thank you so much for your reply!

Could you please clarify why joblib, a de-facto standard save/load mechanism in scikit-learn ecosystem, doesn't work for you?

joblib definitely works in serializing / deserializing models.

I think my point is whether it is a better idea to unify the model saving / loading APIs. After all, boosters can be saved by calling save_model(). But for sklearn models, saving and loading utilize a different routine. This leads to different saved formats depending on which models are trained. A saved booster is a simple text file. It allows users to see every trained detail in a plain text reader, which a model saved via pickle / joblib doesn't have.

Another example would be the linked PR I am currently working on. Currently the model saving routine needs to be handled differently based on logged model types (L146 doesn't work for all LightGBM models).

Please let me know your suggestions and which approach is preferred (b/c it would affect how I should implement the autologging for LightGBM sklearn models). Thank you very much for your feedback and ideas!

StrikerRUS commented 2 years ago

But for sklearn models, saving and loading utilize a different routine. This leads to different saved formats depending on which models are trained.

I believe this is OK for different ecosystems.

A saved booster is a simple text file. It allows users to see every trained detail in a plain text reader, which a model saved via pickle / joblib doesn't have.

In case you need a text model representation, you can easily get access to the underlying Booster class via booster_ property of sklearn model.

To implement text save/load behavior for sklearn interface label transformations should be saved as well as basic model:

jwyyy commented 2 years ago

@StrikerRUS Thank you for your response! It helps me understand how LightGBM sklearn APIs are designed.

In case you need a text model representation, you can easily get access to the underlying Booster class via booster_ property of sklearn model.

The purpose here is not to obtain a text representation. It is what modes of model saving / loading can make sklearn APIs easier to use.

[[Python] How to create LGBMClassifier from booster? #1942](https://github.com/microsoft/LightGBM/issues/1942)

I agree that it is one way to reload a sklearn model. However, I am not sure it is a good practice in general. First, it needs to access internal Python class members starting with _, which is not recommended for safety reasons. Second, the predict() method of the reloaded object doesn't have the same behavior as the original one: according to the example, the reloaded method returns a vector of predicted class probabilities, whereas the original method should output class labels directly (label encoders are not saved, and so L1038 doesn't work in the reloaded model).

I understand that there are definitely multiple ways to save / load sklearn models. My goal was to see if there is any plan in the community to unify all Python models' saving / loading APIs, including Boosters and other sklearn models.