scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
59.58k stars 25.3k forks source link

Create estimators for inference only #28520

Open gorj-tessella opened 7 months ago

gorj-tessella commented 7 months ago

Describe the workflow you want to enable

Allow a trained estimator to be converted into a form suitable only for predict/transform type operations and not fitting. In many cases, the estimator could be made more compact or performant as part of this transformation.

For instance, feature selection steps in a pipeline may rely on complex models during training, but at inference they simply drop unused features. When deploying the model, conversion of the feature selection step to a simpler form could save memory and model load time.

Describe your proposed solution

Add a new method to BaseEstimator prep_for_inference(self) which returns a model which retains all predict/transform methods but does not necessarily support fitting. By default it would return self. Estimators and transformers could override this as necessary. Pipeline would convert each step.

Describe alternatives you've considered, if relevant

No response

Additional context

No response

gorj-tessella commented 7 months ago

Alternatively, this functionality could be supported on just feature selection and pipelines. This would require a PreselectedFeatures class that all feature selection classes could convert into.

glemaitre commented 7 months ago

Did you have a look at sklearn-onnx when the idea is only to have the inference part?

adrinjalali commented 7 months ago

cc @GaelVaroquaux since you were mentioning exactly this the other day.

amueller commented 7 months ago

@adrinjalali @GaelVaroquaux curious what your thoughts were. My intuition is to use onnx or ... completely change the scikit-learn API and have the fitted model be a different class than the fitting algorithm ;)

GaelVaroquaux commented 7 months ago

@adrinjalali @GaelVaroquaux curious what your thoughts were.

Facilitate restoring predictors from storage, including across versions. Consider for instance linear models for regression. The prediction function is something really simple that is easy to keep stable across time. On the other hand, for the fitting algorithm, it's much harder to promise that options won't change, or that a fitting procedure will give the same result on the same data.

My intuition is to use onnx

I would say we should consider it optionally, but one of the factors of success of scikit-learn historically has been that it requires very little that is not installed on every data scientist's computer.

or ... completely change the scikit-learn API and have the fitted model be a different class than the fitting algorithm ;)

I think that we want to go this way. Another factor of success of scikit-learn is that it exposes a very simple surface to users, with little to learn or understand.

adrinjalali commented 7 months ago

This can also be a separate package (to iterate faster also).

We could have a scikit-learn-predictor kind of thing, where we get predictors from our classes. Testing is also not that hard, we test if the output of the predictor is the same as the sklearn native class. However, this does seem A LOT like ONNX, with the benefit of it being in python and lightweight.

betatim commented 7 months ago

Having a "predictor only" solution to the problem of "I trained a model in vX and now want to use it on vX+1" would be cool. Especially because there is currently no good answer to the problem and it crops up semi regularly on the issue tracker.