slickml / slick-ml

SlickML 🧞: Slick Machine Learning in Python
https://www.SlickML.com
MIT License
27 stars 8 forks source link

[BUG]: XGBoost model objects are not currently serializable #181

Open amirhessam88 opened 1 year ago

amirhessam88 commented 1 year ago

Contact Details [Optional]

No response

What Operating System (OS) are you using?

Mac

What happened?

All models using the current XGBoostBaseEstimator pattern are not serializable.

from slickml.classification import XGBoostClassifier
import pickle
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X, y = data.data, data.target

clf = XBoostClassifier()
clf.fit(X, y)

with open("model.pkl", "wb") as f:
    pickle.dump(clf, f)

returns the following error:

ValueError: ctypes objects containing pointers cannot be pickled

which is apparently a known thing --> https://stackoverflow.com/questions/9768218/how-to-save-ctypes-objects-containing-pointers

So, we need to figure out a way how to save_model() similar to what vanilla xgboost model does if we wanna keep the current wrapper functionality. This is currently a blocker for using zenml since we need to define a custom materializer if we wanna pass the model in the step --> https://docs.zenml.io/advanced-guide/pipelines/materializers

Relevant Logs/Tracebacks

ValueError: ctypes objects containing pointers cannot be pickled

Code of Conduct

amirhessam88 commented 1 year ago

@fa9r Felix from ZenML ideas 👇 : I can see two ways how this could be done: Adjust all your classes to work out of the box with pickle. Since your classes all inherit from sklearn.base.BaseEstimator , ZenML will try to use its SklearnMaterializer to save/load your models by default, which in turn simply calls pickle.dump() and pickle.load(). This would require some debugging/digging into pickle, but you don’t have to write/use any ZenML custom materializers. Write (potentially multiple) custom materializers for your classes. This is the cleaner solution in my opinion since you will have more control over how your classes are loaded/saved and it also allows you to use existing save/load utilities from other libraries, like XGB’s save_model(). How many materializers you would need will depend on your personal requirements, but from looking at your codebase, I think two might be enough for now: XGBoostEstimatorMaterializer that handles any subclasses of BaseXGBoostEstimator using XGB’s save_model() and load_model() internally. See ZenML’s XgboostBoosterMaterializer implementation for reference. If you have any additional attributes/… in your custom classes that cannot be reconstructed from the XGB model, you could also save those in an additional file using the file format of your choice. GLMNetEstimatorMaterializer that handles your GLMNet classes. I’m not familiar with GLMNet, so I can’t give any suggestions on how those might work. However, it might make sense to add a BaseGLMNetEstimator abstraction in your code base so you can add more GLMNet models in the future without having to modify this materializer every time.

For general instructions on how to write a custom materializer, see here: https://docs.zenml.io/advanced-guide/pipelines/materializers#building-a-custom-materializer