microsoft / hummingbird

Hummingbird compiles trained ML models into tensor computation for faster inference.
MIT License
3.32k stars 274 forks source link

[Question] Support for custom estimators and custom transformers #707

Closed mbignotti closed 1 year ago

mbignotti commented 1 year ago

Hi! I'm not sure if it's already been asked, but how difficult would it be to implement converters for custom models and custom transformers?

I often find myself writing wrappers to sklearn models and/or transformers from scratch. But then I loose all the benefits of existing sklearn converters, such as onnx or Hummingbird.

Here is a simple example of a custom model for anomaly/fault detection

import numpy as np
from sklearn.base import BaseEstimator
from sklearn.decomposition import PCA

class PCADetector(BaseEstimator):
    def __init__(self, n_components: int = 2) -> None:
        super().__init__()
        self.n_components = n_components

    def fit(self, X: np.ndarray, y: np.ndarray = None, **kwargs) -> PCADetector:
        self.estimator_ = PCA(n_components=self.n_components)
        self.estimator_.fit(X)
        return self

    def predict(
        self,
        X: np.ndarray,
        **kwargs,
    ) -> np.ndarray:
        X_hat = self.estimator_.inverse_transform(self.estimator_.transform(X))
        residuals = X - X_hat
        spe = np.sqrt(np.sum(residuals**2, axis=1))
        return spe

This, of course, cannot be converted with Hummingbird. It's necessary to write a custom converter, I guess.

Thanks a lot!

interesaaat commented 1 year ago

Ciao Marco, adding a custom op shouldn't be too hard. Unfortunately at the moment we don't provide a specific API for this but I can tell you how you can do it. (we love contributions :smile:).

So first thing you need to add the class of your custom op among the supported ops.

Then you need to write a converter taking as input your operator and returning a pytorch model version. To do this, first you need to register a converter. You can use this as an example where instead of having "SklearnMLPClassifier" you should put "Sklearn_your_custom_op_class_name".

Then you need to provide the actual converter. Given your implementation that is pretty much uses a bunch on np funtions, should be straightforward to implement it. You can look into other converts implementations to get an idea on how you can do it. For example here.

Let me know if this works for you.

mbignotti commented 1 year ago

Hi @interesaaat! Thank you for your reply! So, if I understand well, the idea is that you take parameters and other relevant attributes (e.g. classes_) from fitted sklearn estimators and pass them to a corresponding nn.Module, that implements the same logic.

However, I'm wondering if, in this case, it's easier to simply create a new nn.Module class (instead of inheriting from sklearn.base.BaseEstimator) that internally uses an hummingbird-converted class. I'm not 100% sure how I would write it, but what I mean is something like this (ignoring the fact that inverse_transform is not supported):

class PCADetector(torch.nn.Module):

    def __init__(self, n_components):
        super().__init__()
        self.n_components = n_components

    def fit(self, X: np.ndarray):
        model = PCA(n_components=self.n_components)
        model.fit(X)
        self.estimator_ = convert(model, backend="pytorch", test_input=X)

    def forward(self, x):
        x_hat = self.estimator_.inverse_transform(self.estimator_.transform(x))
        residuals = x - x_hat 
        spe = np.sqrt(np.sum(residuals**2, axis=1))
        return spe

To give a little bit of context, I'll try to explain why I would like to do.

The final goal is being able to deploy these models without having to deal with python package. The big problem of sklearn, and python in general for machine learning, is that it's very difficult to deploy custom models when you are not allowed to use docker in production (our case). Custom models might be defined in a project-related repo, and the only way to ship it is to bundle them together with the source code. But this is something we want to avoid, as it might raise other dependencies issue.

Another approach is to compile or convert the model to somthing like onnx or tvm. However, onnx and tvm support is very limited for custom models that are not using deep learning frameworks. That's why I'm trying to understand if Hummingbird could help me.

However, I'm not sure if a composition approach, like the one above, can be adapted to work with subsequent conversions to onnx or similar. On the other side, maybe following the official approach you described to register custom operators in Hummingbird, might be more robust.

What do you think?

Thanks again, Marco.

interesaaat commented 1 year ago

Yea the approach above won't work because even if you wrap the model as a pytorch module, the internal code still uses numpy so you will need that dependency + python. Hummingbird should be able to help in your use case because as long as you provide your model implementation as tensor operations, using TorchScript or ONNX you can export it without any python or other dependencies.

mbignotti commented 1 year ago

The only thing I don't like is having to write twice the same model. But, at this point, I guess that this is the only way to go (I've really investigated all possible solutions I could find). Because the only alternative solution I see, is to directly write the code in a compiled language. I'll try to implement it and let you know if that works.

interesaaat commented 1 year ago

I don't think you will need to write twice the model. Only the inference part (which looks quite easy). For your next model you could just write the fit method in numpy and the predict in pytorch so that you don't to replicate any work. Keep us posted!

interesaaat commented 1 year ago

Closing at the moment. We can reopen in case.

mbignotti commented 1 year ago

Unfortunately I haven't had the time to work on it. I'll update you as soon as I can. Thanks!