pycaret / pycaret

An open-source, low-code machine learning library in Python
https://www.pycaret.org
MIT License
8.93k stars 1.77k forks source link

[DOC]: clarification on exporting pipeline/transformer #2993

Closed verajosemanuel closed 1 year ago

verajosemanuel commented 2 years ago

pycaret version checks

Location of the documentation

https://pycaret.gitbook.io/docs/get-started/functions/deploy#save_model

Documentation problem

I am in a position in which a colleague that only uses sklearn (not permitted to install pycaret on the server) needs the pre-processing pipeline used for training the XGBOOST model. To share the transformations done to data, i have been reading the documentation seeking on how exporting the pipeline in a way that can be used by sklearn but to no avail.

I've found I can save a model using save_model function but that file is meant for pycaret later use. I would like more clarification on exporting steps and objects to be consumed outside pycaret when this package is not available for whatever reasons.

My ideal process would be to train model using pycaret, choose the best, and then export preprocessing steps done to input data in a way that my colleague could take that file and use in sklearn to transform data to see if it fits their server workflow and test diffetent modeling aproaches just in case

Regards

Suggested fix for documentation

A better explanation on how to export steps as preprocessing or transformers (even models) for using it outside pycaret in case the destination environment only has sklearn available.

moezali1 commented 2 years ago

@tvdboom Can you remind me what was our conclusion on this. With PyCaret 3.0, do we need to have pycaret installed in the inference environment or just sklearn would work? I think originally our assumption is pickle format is self-contained, hence we do not need pycaret installed in target inference environment but I can't remember if this was our final conclusion?

tvdboom commented 2 years ago

Unfortunately , it's not possible to use pycaret's transformation pipeline without installing the library. The reason is that sklearn didn't offer all transformation steps we desired for pycaret (nor the pipeline flexibility, think off allowing transformers that drop rows) so we created custom ones. Pickle is not self-contained. You need the library to be able to use the unpickled object correctly.

If you are sure that you are only using sklearn transformers in the pipeline, you could do the following:

  1. Get the pipeline from the pycaret experiment (pipeline attribute)
  2. Add all sklearn transformers and final estimator to a list. Note that they are wrapped in a class called TransformerWrapper. The attribute transformer of this class contains the underlying estimator.
  3. Make sure all these estimators are indeed from sklearn (check their __module__ attribute)
  4. Create a new sklearn pipeline using the list of estimators

This could work, but we are making no assurances.