Open YannDubs opened 5 years ago
Just so that I understand the intent correctly, you propose something that enables the user to easily use pretrained models or transforming/embedding models in sklearn pipelines? Generally I like this idea!
Basically we could implement transform
as a call to forward
and expect that most users will use the class for pretrained models that directly outputs e.g. embeddings. Users are able to override the transform
method if they want to, for example in cases where models output features in transform
and do classifications via predict
.
@githubnemo Yes this is exactly what I'm saying. Once in this new embedded space, one could use any algorithm in sklearn for classification / regression / clustering with much less parameters due to the low dimensionality of the embedded space. I feel that this would be a major use case of skorch: by taking advantage of the whole sklearn library.
Yes I think that the output of transform
should be a numpy concatenation of the first output of forward
. For training purposes you could define your own loss and use multiple outputs, as the embeddings will often not be directly used in training.
This is a really good idea! I am thinking of a workflow where one uses a NeuralNetClassifier
for training and then pull out a piece of the neural net, put it into a NeuralNetTransformer
as a feature extractor.
@thomasjpfan that's exactly what I'm currently doing with a Semi Supervised VAE (M2), but as you say I think that there are many other applications. E.g. use a resnet and then do for example clustering on the last convolutional output.
And the implementation should be pretty straighforward.
I believe the simplest implementation would be to have transform
do what predict_proba
currently does.
The problem that we may want to return an intermediate result and avoid having to do unnecessary computations is valid. I wonder if this can be solved by checking if we are training:
def forward(self, X):
...
if not self.training:
return intermediate_result
...
return intermediate_result, final_result
Also I would consider switching off internal validation inside the transformer by default, since that value has no immediate benefit (only when combined with the final estimator, which is validated differently).
I don't think you can use training as a proxy (this is why I was talking about a is_transform
flag). Indeed when training my transformer (imagine CNN or VAE) I would probably still want a validation set to do something like early stopping. In which case you will not be training and still returning final_result
.
Basically I am thinking something like:
resnet=... # return layer before softmax as intermediate result
transformer=NeuralNetTransformer(resnet, ...)
pipe=Pipeline([("transformer", transformer),("clustering","KMeans")])
clusters=pipe.fit_predict(data)
In which case NeuralNetTransformer would do self.module_.is_transform=False
in fit
and self.module_.is_transform=True
in transform
.
Not sure what you mean with "internal validation" ?
Not sure what you mean with "internal validation" ?
I meant exactly the thing that skorch does where it splits off some data as a validation set for early stopping etc.
The issue I see with the is_transform
attribute or similar is that this would require special adjustments of the pytorch module that don't make any sense without skorch. self.training
is at least something that already exists.
Let me make a proposal to discuss.
Let's assume we want to use an AE. It consists of 2 submodules, encoder
and decoder
, and for the transformer, we only need the encoder part. Then we could train the AE in a first step using a normal NeuralNet
with all the bells and whistles. Once training is completed, we transfer net.module_.encoder
into the NeuralNetTransformer
. NeuralNetTransformer
would then not do any further training when fit
is called.
Similarly, one could use a pretrained net (like ResNet) and stuff it into NeuralNetTransformer
.
If a user wants to train the module in the NeuralNetTransformer
, it should still be possible. Maybe this can be controlled using an init parameter (requires_training
or so). But then the user has to accept the fact that unnecessary computations may be performed (unless the self.training
trick above is used).
@YannDubs Would this cover your cases?
@BenjaminBossan I agree with the downside from is_transform
.
I think what you propose makes sense 👍 : the only downside is that you have to pretrain outside of sklearn pipeline and then put it back in as a pretrained transformer (which is only 2 lines of additional code).
So to train a module in skorch and then create a transfomer you would do :
vae_trainer = NeuralNetClassifier(VAE, ...)
vae_trainer.fit(data)
pipe=Pipeline([("transformer", NeuralNetTransformer(vae_trainer.module_.encoder, ...),
("classifier",SVC(...))])
y_pred=pipe.fit_predict(data)
Another solution that I could think of is something along this line: Suppose we had a context manager that allows to temporarily switch out the module_
attribute of NeuralNet
et al. Then a user could specify this:
class MyTransformer(NeuralNetTransformer):
def transform(self, X):
with temporary_module(self, self.module_.encoder):
return super().transform(X)
I haven't completely thought this through but it doesn't seem to be too complicated
I just noticed https://github.com/scikit-learn/scikit-learn/pull/8574 which adds transform
to MLPClassifier
. If we have an API that can specific which layer in the module to output (when calling transform), this could enable the NeuralNet to behavior as a transformer.
I'm a bit rusty on this issue, so let me try to summarize the constraints we should consider:
transform
attribute (after checking fit_transform
) so that it can be used as a transformer. We would therefore need to (conditionally) add a transform
method on our NeuralNet
.module_
as the transform output.One proposal would be to add a new __init__
parameter, say transform_method
, which is None
by default. It can be set to be a string, in which case we use that string as the method name to get the results from the module_
.
So e.g.:
# inspired by torchvision's SqueezeNet
class MyModule(nn.Module):
def features(...):
...
def forward(self, X):
X = self.features(X)
return self.classifier(X)
net = NeuralNet(MyModule, transform_method='features', ...)
net.fit(X, y).transform(X)
Then we would turn on a similar machinery as when calling predict_proba
. The adjustments to the module required for this to work should be minimal (it's probably a good idea in the first place to split the forward
method into functional sub-methods).
Regarding the implementation, I see two avenues:
One solution might still be the context manager as proposed above. If we expose that context manager as a public function, users could also use it to, for instance, make forward_iter
return the output of an arbitrary method on the module.
Alternatively, we could allow the method name to be passed to forward_iter
. The advantage would be that it's a little bit less "magic" than a context manager. The disadvantage is the rather deep call stack (transform
-> forward_iter
-> evaluation_step
-> infer
), with each method needing to drag along the method name.
Regarding the implementation, I see two avenues:
Passing the transform_method
would also change the signature of the functions as well. For this specific case I would prefer the context manager.
In a slightly different context, how would one go about saving and loading such a network? Given:
class MyModule(nn.Module):
def __init__(...num_classes...):
self.feature_extractor=FeatureExtractor(...)
self.classifier=SomeClassifier(num_classes,...)
def forward(self, X):
X = self.feature_extractor(X)
return self.classifier(X)
If net.save_params
only saves self.feature_extractor
then resuming training for this class will be impossible.
However, saving MyModule
in its entirety leads to its own problems:
num_classes
.state_dict
into a FeatureExtractor
instance would require load_params(strict=False)
.strict
need to be dynamic: To resume training we need strict=True
, for inference/feature extraction we need strict=False
.Thanks!
@guybuk I understand your general idea but not all the details.
What function are you talking about when you mention load_params(strict=...)
? I don't see any function with that name and argument. Or are you suggesting to add it?
Regarding the general problem: I think it would be difficult to handle this from the skorch side of things because there are just too many possibilities to encode all of them as arguments to net.save_params
and net.load_params
.
It might be better to handle this on the side of the module. Under the hood, skorch calls:
# save_params
torch.save(self.module_.state_dict(), ...)
# load_params
self.module_.load_state_dict(torch.load(f, ...))
It's more complicated than that but that's the meat. Therefore, I you want to only store and load some parts of your module, you could modify it's state_dict
and load_state_dict
methods; for instance, you could pluck out certain values from the dict that are not required for prediction. My reasoning is that given that the person who defines the module should know best what is and is not required for it to work, it might be best to give control over this matter to that person.
Maybe there is a more canonical way of controlling this in pytorch but a quick search didn't reveal any. Do you think this would work for you?
What function are you talking about when you mention load_params(strict=...)? I don't see any function with that name and argument. Or are you suggesting to add it?
I was unclear. I was talking about net.load_params(..)
which has a call inside self.module_.load_state_dict(state_dict,[strict=True])
with the strict
being True
by default.
I did manage to solve the problem using vanilla pytorch, but am wondering whether there isn't a more general solution that could be incorporated in skorch.
Since the discussion was about creating an NeuralNetClassifier
and using a part of that network as a NeuralNetTransformer
, let's assume the following:
class MyModule(nn.Module):
def __init__(...num_classes...):
self.feature_extractor=FeatureExtractor(...)
self.classifier=SomeClassifier(num_classes,...)
def forward(self, X):
X = self.feature_extractor(X)
return self.classifier(X)
trainer=NeuralNetClassifier(MyModule...)
trainer.fit(...)
trainer.save_params(checkpoint=cp)
Wouldn't this be the desired behavior?:
resumed_trainer=NeuralNetClassifier(MyModule...)
resumed_trainer.load_params(cp)
resumed_trainer.fit(...) # resume training
and:
feature_extractor=NeuralNetTransformer(FeatureExtractor...)
feature_extractor.load_params(cp) # this will fail due to missing parameters in the state_dict
Due to my inexperience I don't know whether this is possible or how to approach it.
So if I understand your proposal correctly, you would like this last step to be successful?
If I'm not mistaken, the state_dict
doesn't store any information about what it actually is (a FeatureExtractor
or a MyModule
e.g.). This is intentional, because even if the FeatureExtractor
implementation is slightly changed, restoring the state dict will still work.
That means that when skorch's load_params
sees the state dict, it cannot tell what it is supposed to match. Therefore, implementing your proposal looks very difficult to me. We would need to create a complex logic that tries to match parameters based on their names. I can imagine this to lead to all kind of difficulties.
Do you think that my earlier proposal of overriding the state_dict
and load_state_dict
could work for your case? You could even make it depend on a parameter:
class MyModule(nn.Module):
def __init__(self, num_classes=2, store_only_features=False):
super().__init__()
self.feature_extractor=FeatureExtractor()
self.classifier=SomeClassifier(num_classes)
self.store_only_features = store_only_features
def forward(self, X):
X = self.feature_extractor(X)
if self.classifier:
return self.classifier(X)
return X
def state_dict(self, *args, **kwargs):
sd = super().state_dict(*args, **kwargs)
if not self.store_only_features:
return sd
new_sd = {key: val for key, val in sd.items() if key.startswith('feature_extractor')}
return new_sd
def load_state_dict(self, state_dict, strict=True):
if not self.store_only_features:
return super().load_state_dict(state_dict, strict=strict)
self.classifier = None
return super().load_state_dict(state_dict, strict=False)
After trying to work on the saving/loading problem myself I realized those difficulties on my own. I dropped the attempt to save the feature extractor and the classifier separately. What I will most likely do is save the model as a pickle, and that way I won't be forced to "remember" the original pytorch module's constructor parameters.
Back to the original topic: for what it's worth, regarding this topic, I implemented the NeuralNetTransformer
like this:
class NeuralNetTransformer(NeuralNet, TransformerMixin):
def transform(self, X):
embedding = []
for outs in self.forward_iter(X, training=False):
outs = outs[1] if isinstance(outs, tuple) else outs
embedding.append(to_numpy(outs))
transforms = np.concatenate(embedding, 0)
return transforms
I figured that if predict_proba
can implicitly assume to take the first output from forward, transform
could go by the same logic and take the second output. The sklearn API doesn't provide more functions that could take advantage of the forward
method, so this logic shouldn't be scaling out of control.
This looks like a reasonable solution. How do you deal with outputs being computed for the transform step that are actually not required? Or does this problem not apply here?
The problem does apply, however the overhead is small and it's a solution better worth having than not.
My two cents are that in general I'm unaware of any sklearn models that have both a predict
and transform
method to take inspiration from, and the closest thing to it would be an sklearn.Pipeline
with two components that are two different parts of the network. This makes a lot of sense since the most common use of a transform
with a neural network would be to use it in a pipeline with a knn or PCA or TSNE on top anyway. This of course is problematic because a pipeline trains each part separately, plus issues related to pytorch/backprop even if training together was possible.
This of course is problematic because a pipeline trains each part separately, plus issues related to pytorch/backprop even if training together was possible.
Can you elaborate on that? What exactly are the issues here?
For me, it looks like it works. Here is a minimal example that builds upon your suggestion:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from skorch import NeuralNet
from skorch.utils import to_numpy
import torch
from torch import nn
import torch.nn.functional as F
torch.manual_seed(0)
X, y = make_classification(1000, 20, n_informative=10, random_state=0)
X, y = X.astype(np.float32), y.astype(np.int64)
class AutoencoderModule(nn.Module):
def __init__(
self,
input_units=20,
bottleneck_units=5,
):
super().__init__()
self.input_units = input_units
self.bottleneck_units = bottleneck_units
self.reset_params()
def reset_params(self):
self.dense0 = nn.Linear(self.input_units, self.bottleneck_units)
self.dense1 = nn.Linear(self.bottleneck_units, self.input_units)
def forward(self, X, **kwargs):
X_bottleneck = self.dense0(X)
X_out = self.dense1(F.relu(X_bottleneck))
X_rec = F.tanh(X_out) # range -1..1
return X_rec, X_bottleneck
class NeuralNetTransformer(NeuralNet, TransformerMixin):
def get_loss(self, y_pred, y_true, X, **kwargs):
y_pred, _ = y_pred
return super().get_loss(y_pred, y_true=X, X=X, **kwargs)
def transform(self, X):
out = []
for outs in self.forward_iter(X, training=False):
outs = outs[1] if isinstance(outs, tuple) else outs
out.append(to_numpy(outs))
transforms = np.concatenate(out, 0)
return transforms
pipe = Pipeline([
('scale', MinMaxScaler(feature_range=(-1, 1))), # range -1..1
('net', NeuralNetTransformer(
AutoencoderModule,
criterion=nn.MSELoss,
module__bottleneck_units=5,
lr=0.5,
)),
('clf', LogisticRegression()),
])
cross_val_score(pipe, X, y, scoring='accuracy')
# returns array([ 0.61 , 0.555, 0.72 , 0.485, 0.56 ])
My mistake, your example looks great!
Though you may have done it that way for simplicity, I'll say just in case that I'd probably move the get_loss
method to a child class because some transformers still train on a supervised pretext task rather than unsupervised like an autoencoder.
Finally, as far as I can tell, the only issue left currently is wasted computation. Pytorch itself doesn't offer a neat solution to build upon (You need to call a submodule's forward
), so I'm thinking there's no need to do much in skorch either. If anyone would like to explicitly save module_.encoder
or module_.feature_extractor
into a new NeuralNetTransformer
, they can, just like in pytorch.
Though you may have done it that way for simplicity, I'll say just in case that I'd probably move the
get_loss
method to a child class because some transformers still train on a supervised pretext task rather than unsupervised like an autoencoder.
Yes, this is a somewhat specific solution for this example. Other people may need slightly different solutions.
My conclusion for this issue so far is that it might not be best to provide a one-size-fits-all NeuralNetTransformer
in skorch. Instead, I could see adding an example to the docs or a notebook as a better solution. This would encourage users to make their own adaptations, which will better fit their problems than anything ready-made that we could provide.
Later, if we find there is still a need for a built-in NeuralNetTransformer
, we can always add it, but if we add a botched transformer, it will be hard to remove it later.
Just an idea, but what if we make it so the transform
function adds a forward hook after the relevant layer, runs predict, and then removes it?
class Transformer(NeuralNet, TransformerMixin):
def __init__(self, module, transform_layer, *args, **kwargs):
super(Transformer, self).__init__(**kwargs)
self.transform_layer = module[transform_layer]
def transform(self, X):
hook = self.transform_layer.register_forward_hook(Transformer.early_stop)
transforms = self.predict_proba(X)
hook.remove()
return transforms
def infer(self, x, **fit_params):
x = to_tensor(x, device=self.device)
try:
if isinstance(x, dict):
x_dict = self._merge_x_and_fit_params(x, fit_params)
outs= self.module_(**x_dict)
outs =self.module_(x, **fit_params)
except CustomException as e:
outs=e.layer_output
finally:
return outs
@staticmethod
def early_stop(mod, input, output):
raise Exception(output)
The benefits I see:
forward
output can remain clean (no need to return the transform in the module's forward
)The down side is obviously how ugly it feels to use exceptions in this way.
I haven't worked with forward hooks, so I'm not 100% confident I understand all ramifications of your proposal.
I agree that using exceptions as a form of control flow is a bit ugly. I assume that this is to stop the execution of the module in the middle, once the desired intermediate output has been generated. However, I wonder if this cannot have some weird interactions, e.g. when any form of parallelization is used.
Another addition that seems necessary is to make sure that hook.remove
is always called, otherwise it could result in a nasty side effect. Presumably, using a context manager could work here.
Finally, this won't work if the transformer output is not the direct output of a sub-module. E.g.
def forward(self, X):
Xe = self.encoder(X)
Xt = Xe + 1 # <-- desired output
Xd = self.decoder(Xt)
return Xd
Not a big deal, but still.
Overall, I like the idea but feel it might be too hacky to include as a core skorch feature, since once it's added, we would need to maintain it and fix all bugs that could arise from it.
Hi,
To be really sklearn-ish I think it would be very nice to have a class
NeuralNetTransformer
which inherits fromTransformerMixin
and implements a.transform
method.Usecases: representation learning.
Example: use a VAE for dimensionality reduction then work in this new representation space (no joint learning). This is often done in semi supervised learning (e.g. M1-M2 model), reinforcement learning (e.g. world models), ... but could also enable the use of all the predictors in sklearn by first transforming using a VAE / pretrained CNN without last layers and then predict using any sklearn predictor (e.g. svm) or perform some clustering all in a single Pipeline.
How: the quickest is to have a transform function which does the same as predict (i.e. concatenate the first outputs of forward). But the representation is typically a temporary output of the model (i.e. latent in VAE or one of the middle layer in a CNN), so it would be better not to compute the whole forward pass if only the first few steps are needed. One could maybe have a
.transform
function in the model that only returns what is needed? Or maybe ais_transform
flag to theforward
?The first method has the downside of making more computation than needed and not being able to predict and transform with the same module (although this latter issue could be solved by using for example the second output of forward as transform). The second method is better but it requires the model to have an additional function / flag, which skorch was able to circumvent for the classifier and the regressor.
PS: thanks for the great library :)