skorch-dev / skorch

A scikit-learn compatible neural network library that wraps PyTorch
BSD 3-Clause "New" or "Revised" License
5.88k stars 390 forks source link

[Feature request] Adding NeuralNetTranformer #482

Open YannDubs opened 5 years ago

YannDubs commented 5 years ago

Hi,

To be really sklearn-ish I think it would be very nice to have a class NeuralNetTransformer which inherits from TransformerMixin and implements a .transform method.

Usecases: representation learning.

Example: use a VAE for dimensionality reduction then work in this new representation space (no joint learning). This is often done in semi supervised learning (e.g. M1-M2 model), reinforcement learning (e.g. world models), ... but could also enable the use of all the predictors in sklearn by first transforming using a VAE / pretrained CNN without last layers and then predict using any sklearn predictor (e.g. svm) or perform some clustering all in a single Pipeline.

How: the quickest is to have a transform function which does the same as predict (i.e. concatenate the first outputs of forward). But the representation is typically a temporary output of the model (i.e. latent in VAE or one of the middle layer in a CNN), so it would be better not to compute the whole forward pass if only the first few steps are needed. One could maybe have a .transform function in the model that only returns what is needed? Or maybe a is_transform flag to the forward ?

The first method has the downside of making more computation than needed and not being able to predict and transform with the same module (although this latter issue could be solved by using for example the second output of forward as transform). The second method is better but it requires the model to have an additional function / flag, which skorch was able to circumvent for the classifier and the regressor.

PS: thanks for the great library :)

githubnemo commented 5 years ago

Just so that I understand the intent correctly, you propose something that enables the user to easily use pretrained models or transforming/embedding models in sklearn pipelines? Generally I like this idea!

Basically we could implement transform as a call to forward and expect that most users will use the class for pretrained models that directly outputs e.g. embeddings. Users are able to override the transform method if they want to, for example in cases where models output features in transform and do classifications via predict.

YannDubs commented 5 years ago

@githubnemo Yes this is exactly what I'm saying. Once in this new embedded space, one could use any algorithm in sklearn for classification / regression / clustering with much less parameters due to the low dimensionality of the embedded space. I feel that this would be a major use case of skorch: by taking advantage of the whole sklearn library.

Yes I think that the output of transform should be a numpy concatenation of the first output of forward. For training purposes you could define your own loss and use multiple outputs, as the embeddings will often not be directly used in training.

thomasjpfan commented 5 years ago

This is a really good idea! I am thinking of a workflow where one uses a NeuralNetClassifier for training and then pull out a piece of the neural net, put it into a NeuralNetTransformer as a feature extractor.

YannDubs commented 5 years ago

@thomasjpfan that's exactly what I'm currently doing with a Semi Supervised VAE (M2), but as you say I think that there are many other applications. E.g. use a resnet and then do for example clustering on the last convolutional output.

And the implementation should be pretty straighforward.

BenjaminBossan commented 5 years ago

I believe the simplest implementation would be to have transform do what predict_proba currently does.

The problem that we may want to return an intermediate result and avoid having to do unnecessary computations is valid. I wonder if this can be solved by checking if we are training:

def forward(self, X):
  ...
    if not self.training:
       return intermediate_result
    ...
    return intermediate_result, final_result

Also I would consider switching off internal validation inside the transformer by default, since that value has no immediate benefit (only when combined with the final estimator, which is validated differently).

YannDubs commented 5 years ago

I don't think you can use training as a proxy (this is why I was talking about a is_transform flag). Indeed when training my transformer (imagine CNN or VAE) I would probably still want a validation set to do something like early stopping. In which case you will not be training and still returning final_result.

Basically I am thinking something like:

resnet=...   # return layer before softmax as intermediate result
transformer=NeuralNetTransformer(resnet, ...)
pipe=Pipeline([("transformer", transformer),("clustering","KMeans")])
clusters=pipe.fit_predict(data)

In which case NeuralNetTransformer would do self.module_.is_transform=False in fit and self.module_.is_transform=True in transform.

Not sure what you mean with "internal validation" ?

BenjaminBossan commented 5 years ago

Not sure what you mean with "internal validation" ?

I meant exactly the thing that skorch does where it splits off some data as a validation set for early stopping etc.

The issue I see with the is_transform attribute or similar is that this would require special adjustments of the pytorch module that don't make any sense without skorch. self.training is at least something that already exists.

Let me make a proposal to discuss.

Let's assume we want to use an AE. It consists of 2 submodules, encoder and decoder, and for the transformer, we only need the encoder part. Then we could train the AE in a first step using a normal NeuralNet with all the bells and whistles. Once training is completed, we transfer net.module_.encoder into the NeuralNetTransformer. NeuralNetTransformer would then not do any further training when fit is called. Similarly, one could use a pretrained net (like ResNet) and stuff it into NeuralNetTransformer.

If a user wants to train the module in the NeuralNetTransformer, it should still be possible. Maybe this can be controlled using an init parameter (requires_training or so). But then the user has to accept the fact that unnecessary computations may be performed (unless the self.training trick above is used).

@YannDubs Would this cover your cases?

YannDubs commented 5 years ago

@BenjaminBossan I agree with the downside from is_transform.

I think what you propose makes sense 👍 : the only downside is that you have to pretrain outside of sklearn pipeline and then put it back in as a pretrained transformer (which is only 2 lines of additional code).

So to train a module in skorch and then create a transfomer you would do :

vae_trainer = NeuralNetClassifier(VAE, ...)
vae_trainer.fit(data)

pipe=Pipeline([("transformer", NeuralNetTransformer(vae_trainer.module_.encoder, ...),
              ("classifier",SVC(...))])
y_pred=pipe.fit_predict(data)
BenjaminBossan commented 5 years ago

Another solution that I could think of is something along this line: Suppose we had a context manager that allows to temporarily switch out the module_ attribute of NeuralNet et al. Then a user could specify this:

class MyTransformer(NeuralNetTransformer):
    def transform(self, X):
        with temporary_module(self, self.module_.encoder):
            return super().transform(X)

I haven't completely thought this through but it doesn't seem to be too complicated

thomasjpfan commented 4 years ago

I just noticed https://github.com/scikit-learn/scikit-learn/pull/8574 which adds transform to MLPClassifier. If we have an API that can specific which layer in the module to output (when calling transform), this could enable the NeuralNet to behavior as a transformer.

BenjaminBossan commented 4 years ago

I'm a bit rusty on this issue, so let me try to summarize the constraints we should consider:

  1. AFAICT, sklearn will check whether the estimator has a transform attribute (after checking fit_transform) so that it can be used as a transformer. We would therefore need to (conditionally) add a transform method on our NeuralNet.
  2. We want to be able to specify what output is returned from the module_ as the transform output.
  3. We want to avoid unnecessary work should the transform output be an intermediate result.
  4. We want to avoid introducing any kind of convoluted path into the module that wouldn't make any sense outside the skorch context.
  5. We want to avoid having a separate training and inference estimator.

One proposal would be to add a new __init__ parameter, say transform_method, which is None by default. It can be set to be a string, in which case we use that string as the method name to get the results from the module_.

So e.g.:

# inspired by torchvision's SqueezeNet
class MyModule(nn.Module):
    def features(...):
        ...
    def forward(self, X):
        X = self.features(X)
        return self.classifier(X)

net = NeuralNet(MyModule, transform_method='features', ...)
net.fit(X, y).transform(X)

Then we would turn on a similar machinery as when calling predict_proba. The adjustments to the module required for this to work should be minimal (it's probably a good idea in the first place to split the forward method into functional sub-methods).

Regarding the implementation, I see two avenues:

One solution might still be the context manager as proposed above. If we expose that context manager as a public function, users could also use it to, for instance, make forward_iter return the output of an arbitrary method on the module.

Alternatively, we could allow the method name to be passed to forward_iter. The advantage would be that it's a little bit less "magic" than a context manager. The disadvantage is the rather deep call stack (transform -> forward_iter -> evaluation_step -> infer), with each method needing to drag along the method name.

thomasjpfan commented 4 years ago

Regarding the implementation, I see two avenues:

Passing the transform_method would also change the signature of the functions as well. For this specific case I would prefer the context manager.

guybuk commented 4 years ago

In a slightly different context, how would one go about saving and loading such a network? Given:

class MyModule(nn.Module):
    def __init__(...num_classes...):
        self.feature_extractor=FeatureExtractor(...)
        self.classifier=SomeClassifier(num_classes,...)

    def forward(self, X):
        X = self.feature_extractor(X)
        return self.classifier(X)

If net.save_params only saves self.feature_extractor then resuming training for this class will be impossible. However, saving MyModule in its entirety leads to its own problems:

  1. Would require creating the entire architecture instance; If we only want to extract features, we might not want to pass (Or even haven) all the information about the original network, like num_classes.
  2. Loading the entire state_dict into a FeatureExtractor instance would require load_params(strict=False).
  3. The values for strict need to be dynamic: To resume training we need strict=True, for inference/feature extraction we need strict=False.

Thanks!

BenjaminBossan commented 4 years ago

@guybuk I understand your general idea but not all the details.

What function are you talking about when you mention load_params(strict=...)? I don't see any function with that name and argument. Or are you suggesting to add it?

Regarding the general problem: I think it would be difficult to handle this from the skorch side of things because there are just too many possibilities to encode all of them as arguments to net.save_params and net.load_params.

It might be better to handle this on the side of the module. Under the hood, skorch calls:

# save_params
torch.save(self.module_.state_dict(), ...)
# load_params
self.module_.load_state_dict(torch.load(f, ...))

It's more complicated than that but that's the meat. Therefore, I you want to only store and load some parts of your module, you could modify it's state_dict and load_state_dict methods; for instance, you could pluck out certain values from the dict that are not required for prediction. My reasoning is that given that the person who defines the module should know best what is and is not required for it to work, it might be best to give control over this matter to that person.

Maybe there is a more canonical way of controlling this in pytorch but a quick search didn't reveal any. Do you think this would work for you?

guybuk commented 4 years ago

What function are you talking about when you mention load_params(strict=...)? I don't see any function with that name and argument. Or are you suggesting to add it?

I was unclear. I was talking about net.load_params(..) which has a call inside self.module_.load_state_dict(state_dict,[strict=True]) with the strict being True by default.

I did manage to solve the problem using vanilla pytorch, but am wondering whether there isn't a more general solution that could be incorporated in skorch.

Since the discussion was about creating an NeuralNetClassifier and using a part of that network as a NeuralNetTransformer, let's assume the following:

class MyModule(nn.Module):
    def __init__(...num_classes...):
        self.feature_extractor=FeatureExtractor(...)
        self.classifier=SomeClassifier(num_classes,...)

    def forward(self, X):
        X = self.feature_extractor(X)
        return self.classifier(X)

trainer=NeuralNetClassifier(MyModule...)
trainer.fit(...)
trainer.save_params(checkpoint=cp)

Wouldn't this be the desired behavior?:

resumed_trainer=NeuralNetClassifier(MyModule...)
resumed_trainer.load_params(cp)
resumed_trainer.fit(...) # resume training

and:

feature_extractor=NeuralNetTransformer(FeatureExtractor...)
feature_extractor.load_params(cp) # this will fail due to missing parameters in the state_dict

Due to my inexperience I don't know whether this is possible or how to approach it.

BenjaminBossan commented 4 years ago

So if I understand your proposal correctly, you would like this last step to be successful?

If I'm not mistaken, the state_dict doesn't store any information about what it actually is (a FeatureExtractor or a MyModule e.g.). This is intentional, because even if the FeatureExtractor implementation is slightly changed, restoring the state dict will still work.

That means that when skorch's load_params sees the state dict, it cannot tell what it is supposed to match. Therefore, implementing your proposal looks very difficult to me. We would need to create a complex logic that tries to match parameters based on their names. I can imagine this to lead to all kind of difficulties.

Do you think that my earlier proposal of overriding the state_dict and load_state_dict could work for your case? You could even make it depend on a parameter:

class MyModule(nn.Module):
    def __init__(self, num_classes=2, store_only_features=False):
        super().__init__()
        self.feature_extractor=FeatureExtractor()
        self.classifier=SomeClassifier(num_classes)
        self.store_only_features = store_only_features

    def forward(self, X):
        X = self.feature_extractor(X)
        if self.classifier:
            return self.classifier(X)
        return X

    def state_dict(self, *args, **kwargs):
        sd = super().state_dict(*args, **kwargs)
        if not self.store_only_features:
            return sd

        new_sd = {key: val for key, val in sd.items() if key.startswith('feature_extractor')}
        return new_sd

    def load_state_dict(self, state_dict, strict=True):
        if not self.store_only_features:
            return super().load_state_dict(state_dict, strict=strict)

        self.classifier = None
        return super().load_state_dict(state_dict, strict=False)
guybuk commented 4 years ago

After trying to work on the saving/loading problem myself I realized those difficulties on my own. I dropped the attempt to save the feature extractor and the classifier separately. What I will most likely do is save the model as a pickle, and that way I won't be forced to "remember" the original pytorch module's constructor parameters.

Back to the original topic: for what it's worth, regarding this topic, I implemented the NeuralNetTransformer like this:

class NeuralNetTransformer(NeuralNet, TransformerMixin):
    def transform(self, X):
        embedding = []
        for outs in self.forward_iter(X, training=False):
            outs = outs[1] if isinstance(outs, tuple) else outs
            embedding.append(to_numpy(outs))
        transforms = np.concatenate(embedding, 0)
        return transforms

I figured that if predict_proba can implicitly assume to take the first output from forward, transform could go by the same logic and take the second output. The sklearn API doesn't provide more functions that could take advantage of the forward method, so this logic shouldn't be scaling out of control.

BenjaminBossan commented 4 years ago

This looks like a reasonable solution. How do you deal with outputs being computed for the transform step that are actually not required? Or does this problem not apply here?

guybuk commented 4 years ago

The problem does apply, however the overhead is small and it's a solution better worth having than not.

My two cents are that in general I'm unaware of any sklearn models that have both a predict and transform method to take inspiration from, and the closest thing to it would be an sklearn.Pipeline with two components that are two different parts of the network. This makes a lot of sense since the most common use of a transform with a neural network would be to use it in a pipeline with a knn or PCA or TSNE on top anyway. This of course is problematic because a pipeline trains each part separately, plus issues related to pytorch/backprop even if training together was possible.

BenjaminBossan commented 4 years ago

This of course is problematic because a pipeline trains each part separately, plus issues related to pytorch/backprop even if training together was possible.

Can you elaborate on that? What exactly are the issues here?

For me, it looks like it works. Here is a minimal example that builds upon your suggestion:

import numpy as np
from sklearn.datasets import make_classification
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from skorch import NeuralNet
from skorch.utils import to_numpy
import torch
from torch import nn
import torch.nn.functional as F
torch.manual_seed(0)

X, y = make_classification(1000, 20, n_informative=10, random_state=0)
X, y = X.astype(np.float32), y.astype(np.int64)

class AutoencoderModule(nn.Module):
    def __init__(
            self,
            input_units=20,
            bottleneck_units=5,
    ):
        super().__init__()
        self.input_units = input_units
        self.bottleneck_units = bottleneck_units
        self.reset_params()

    def reset_params(self):
        self.dense0 = nn.Linear(self.input_units, self.bottleneck_units)
        self.dense1 = nn.Linear(self.bottleneck_units, self.input_units)

    def forward(self, X, **kwargs):
        X_bottleneck = self.dense0(X)
        X_out = self.dense1(F.relu(X_bottleneck))
        X_rec = F.tanh(X_out)  # range -1..1
        return X_rec, X_bottleneck

class NeuralNetTransformer(NeuralNet, TransformerMixin):
    def get_loss(self, y_pred, y_true, X, **kwargs):
        y_pred, _ = y_pred
        return super().get_loss(y_pred, y_true=X, X=X, **kwargs)

    def transform(self, X):
        out = []
        for outs in self.forward_iter(X, training=False):
            outs = outs[1] if isinstance(outs, tuple) else outs
            out.append(to_numpy(outs))
        transforms = np.concatenate(out, 0)
        return transforms

pipe = Pipeline([
    ('scale', MinMaxScaler(feature_range=(-1, 1))),  # range -1..1
    ('net', NeuralNetTransformer(
        AutoencoderModule,
        criterion=nn.MSELoss,
        module__bottleneck_units=5,
        lr=0.5,
    )),
    ('clf', LogisticRegression()),
])

cross_val_score(pipe, X, y, scoring='accuracy')
# returns array([ 0.61 ,  0.555,  0.72 ,  0.485,  0.56 ])
guybuk commented 4 years ago

My mistake, your example looks great!

Though you may have done it that way for simplicity, I'll say just in case that I'd probably move the get_loss method to a child class because some transformers still train on a supervised pretext task rather than unsupervised like an autoencoder.

Finally, as far as I can tell, the only issue left currently is wasted computation. Pytorch itself doesn't offer a neat solution to build upon (You need to call a submodule's forward), so I'm thinking there's no need to do much in skorch either. If anyone would like to explicitly save module_.encoder or module_.feature_extractor into a new NeuralNetTransformer, they can, just like in pytorch.

BenjaminBossan commented 4 years ago

Though you may have done it that way for simplicity, I'll say just in case that I'd probably move the get_loss method to a child class because some transformers still train on a supervised pretext task rather than unsupervised like an autoencoder.

Yes, this is a somewhat specific solution for this example. Other people may need slightly different solutions.

My conclusion for this issue so far is that it might not be best to provide a one-size-fits-all NeuralNetTransformer in skorch. Instead, I could see adding an example to the docs or a notebook as a better solution. This would encourage users to make their own adaptations, which will better fit their problems than anything ready-made that we could provide.

Later, if we find there is still a need for a built-in NeuralNetTransformer, we can always add it, but if we add a botched transformer, it will be hard to remove it later.

guybuk commented 4 years ago

Just an idea, but what if we make it so the transform function adds a forward hook after the relevant layer, runs predict, and then removes it?

class Transformer(NeuralNet, TransformerMixin):
    def __init__(self, module, transform_layer, *args, **kwargs):
        super(Transformer, self).__init__(**kwargs)
        self.transform_layer = module[transform_layer]

    def transform(self, X):
        hook = self.transform_layer.register_forward_hook(Transformer.early_stop)
        transforms = self.predict_proba(X)
        hook.remove()
        return transforms

    def infer(self, x, **fit_params):
        x = to_tensor(x, device=self.device)
        try:
            if isinstance(x, dict):
                x_dict = self._merge_x_and_fit_params(x, fit_params)
                outs= self.module_(**x_dict)
            outs =self.module_(x, **fit_params)
        except CustomException as e:
            outs=e.layer_output
        finally:
            return outs

    @staticmethod
    def early_stop(mod, input, output):
        raise Exception(output)

The benefits I see:

  1. You can call transform without actually doing a forward pass through the entire network
  2. Your forward output can remain clean (no need to return the transform in the module's forward)
  3. Can add any and as many layers as you'd like as transforms, switch between them, etc.

The down side is obviously how ugly it feels to use exceptions in this way.

BenjaminBossan commented 4 years ago

I haven't worked with forward hooks, so I'm not 100% confident I understand all ramifications of your proposal.

I agree that using exceptions as a form of control flow is a bit ugly. I assume that this is to stop the execution of the module in the middle, once the desired intermediate output has been generated. However, I wonder if this cannot have some weird interactions, e.g. when any form of parallelization is used.

Another addition that seems necessary is to make sure that hook.remove is always called, otherwise it could result in a nasty side effect. Presumably, using a context manager could work here.

Finally, this won't work if the transformer output is not the direct output of a sub-module. E.g.

def forward(self, X):
    Xe = self.encoder(X)
    Xt = Xe + 1  # <-- desired output
    Xd = self.decoder(Xt)
    return Xd

Not a big deal, but still.

Overall, I like the idea but feel it might be too hacky to include as a core skorch feature, since once it's added, we would need to maintain it and fix all bugs that could arise from it.