skorch-dev / skorch

A scikit-learn compatible neural network library that wraps PyTorch
BSD 3-Clause "New" or "Revised" License
5.9k stars 391 forks source link

How to read input_dim from fit method? #584

Closed gennaro-tedesco closed 4 years ago

gennaro-tedesco commented 4 years ago

I am trying to incorporate PyTorch functionalities into a scikit-learn environment (in particular Pipelines and GridSearchCV) and therefore have been looking into skorch. The standard documentation example for neural networks looks like

class MyModule(nn.Module):
    def __init__(self, num_units=10, nonlin=F.relu):
        super(MyModule, self).__init__()

        self.dense0 = nn.Linear(20, num_units)
        self.nonlin = nonlin
        self.dropout = nn.Dropout(0.5)
        ...
        ...
        self.output = nn.Linear(10, 2)

where you explicitly pass the input and output dimensions by hardcoding them into the constructor. However, this is not really how scikit-learn interfaces work, where the input and output dimensions are derived by the fit method rather than being explicitly passed to the constructors. As a practical example consider

# copied from the documentation
net = NeuralNetClassifier(
    MyModule,
    max_epochs=10,
    lr=0.1,
    # Shuffle training data on each epoch
    iterator_train__shuffle=True,
)

# Pipeline interface
pipeline = Pipeline([
        ('transformation', AnyTransformer()),
        ('net', net)
        ])

gs = GridSearchCV(net, params, refit=False, cv=3, scoring='accuracy')
gs.fit(X, y)

besides the fact that nowhere in the transformers must one specify the input and output dimensions, the transformers that are applied before the model may change the dimentionality of the training set (think at dimensionality reductions and similar), therefore hardcoding input and output in the neural network construction just will not do.

Did I misunderstand how this is supposed to work or otherwise what would be a suggested solution (I was thinking of specifying the constructors into the forward method where you do have X available for fit already, but I am not sure this is good practice)?

BenjaminBossan commented 4 years ago

You are correct, in sklearn the input dimensions are inferred from the data, which is not explicitly supported by skorch. The reason is that pytorch does not support shape inference and if skorch tried to enforce the number of input units, it would put strong restrictions on the underlying pytorch module and data.

Of course, this can be problematic when the input dimensions are not known before runtime. Here is a snippet of code that should work the way you need it (adapted from this notebook):

class ClassifierModule(nn.Module):
    def __init__(
            self,
            input_units=10,
            num_units=10,
            nonlin=F.relu,
            dropout=0.5,
    ):
        super(ClassifierModule, self).__init__()
        self.input_units = input_units
        self.num_units = num_units
        self.nonlin = nonlin
        self.dropout = dropout

        self.dense0 = nn.Linear(self.input_units, num_units)
        self.nonlin = nonlin
        self.dropout = nn.Dropout(dropout)
        self.dense1 = nn.Linear(num_units, 10)
        self.output = nn.Linear(10, 2)

    def forward(self, X, **kwargs):
        X = self.nonlin(self.dense0(X))
        X = self.dropout(X)
        X = F.relu(self.dense1(X))
        X = F.softmax(self.output(X), dim=-1)
        return X

class MyNet(NeuralNetClassifier):
    def check_data(self, X, y):
        super().check_data(X, y)

        if self.module_.input_units != X.shape[1]:
            self.set_params(module__input_units=X.shape[1])
            self.initialize()

Note how there is the parameter input_units on the module that we use to dynamically re-initilaize the module based on the input data. This piece of code may need to be adopted based on your module, data, etc., but I'm sure you'll figure it out.

I was thinking of specifying the constructors into the forward method where you do have X available for fit already, but I am not sure this is good practice

This doesn't sound like a good idea since you would re-initialize your module each time you call forward, which is not what you want. Please try my proposal and report back if there's any problem.

BenjaminBossan commented 4 years ago

@gennaro-tedesco I added some documentation about this in #585

gennaro-tedesco commented 4 years ago

I am trying to get your example to work: shouldn't the new class MyNet that you have defined inherit from the ClassifierModule or would it just return an instance thereof (after the initialisation)? I guess eventually the class that must be part of the pipeline is MyNet, however I do not understand why not including the check_data(self, X, y) (and the subsequent initialisation) in the classifier directly, rather than at this point creating a new class to check for it.

...or again perhaps I am misunderstanding how all these objects are supposed to work together (so correct me if I am wrong).

BenjaminBossan commented 4 years ago

I think you're misunderstanding the example. You would pass the ClassifierModule as the module to MyNet, e.g. like this:

net = MyNet(ClassifierModule, <other-parameters>)
pipeline = Pipeline([
    ('some-transformer', ...)
    ...
    ('net', net),
])
pipeline.fit(X, y)

Does that make it more clear?

gennaro-tedesco commented 4 years ago

Thank you again for providing help; at the moment I am still getting a few errors on objects types and the like (for instance TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found object) or arguments errors when I try to pass additional parameters.

I will debug it myself, do not worry.

BenjaminBossan commented 4 years ago

Good luck with debugging. In case this specific issue is resolved but you cannot resolve the new error, feel free to close this issue and open another one.

gennaro-tedesco commented 4 years ago

Thank you, yes, we can consider this issue as closed.

smith558 commented 1 year ago

Thank you again for providing help; at the moment I am still getting a few errors on objects types and the like (for instance TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found object) or arguments errors when I try to pass additional parameters.

I will debug it myself, do not worry.

How did you solve this error?

smith558 commented 1 year ago

@gennaro-tedesco this missing feature of automatic dimensions inference really makes skorch not work well with existing scikit-learn code and pipelines, this is a bit disappointing since this feature is assumed in building all of scikit-learn and therefore in design when using the library. Similar library scikeras has this feature, was there not any pursuit to add it to skorch too?

BenjaminBossan commented 1 year ago

@smith558 You are commenting on a 3 year old closed issue, it is unlikely you will get a response from the initial author.

Regarding automatic shape inference: The reason why skorch does not support this out of the box is because skorch does not deal with the model (i.e. thhe nn.Module) itself, which is where the logic for shape inference would need to be. These modules are often complicated -- what would it mean to infer the shape of a transformer model? -- so there is no one size fits all solution that skorch could provide.

That being said, for simple cases like MLPs, you can use InputShapeSetter, for anything more complicated, you need to set the shape otherwise.