skorch-dev / skorch

A scikit-learn compatible neural network library that wraps PyTorch
BSD 3-Clause "New" or "Revised" License
5.76k stars 387 forks source link

How to create a Skorch Dataset from a Torch Dataset for Grid search Cross-Validation using Skorch.GridSearchCV() #443

Closed Tsakunelson closed 5 years ago

Tsakunelson commented 5 years ago

We can extend TrainValidDataset to support the full torch.utils.data.Dataset interface:

class TrainValidDataset(torch.utils.data.Dataset):
    def __init__(self, train_ds, valid_ds):
        self.train_ds = train_ds
        self.valid_ds = valid_ds

    def __len__(self):
        return len(self.train_ds) + len(self.valid_ds)

    def __getitem__(self, i):
        if i < len(self.train_ds):
            return self.train_ds[i]
        i = i - len(self.train_ds)
        return self.valid_ds[i]

Hello @thomasjpfan I attempted applying this procedure for K-fold CV with torch ImageFolder datasets, but ran into a TypeError with the following stack trace: image

Here is the code sample:

data = load_data.load('./poorly_cohessive/ck_dataset')
train_ds, valid_ds = data[0], data[1]

net = NeuralNetClassifier(
    PretrainedModel,
    criterion=nn.CrossEntropyLoss,
    lr=0.001,
    batch_size=32,
    max_epochs=25,
    module__output_features=2,
    optimizer=optim.SGD,
    optimizer__momentum=0.9,
    iterator_train__shuffle=True,
    iterator_train__num_workers=4,
    iterator_valid__shuffle=True,
    iterator_valid__num_workers=4,
    train_split=predefined_split(valid_ds),
    callbacks=[lrscheduler, checkpoint, freezer],
    device='cuda' # comment to train on cpu
)
from sklearn.model_selection import GridSearchCV
params = {
    'lr': [0.01, 0.02],
    'max_epochs': [10, 20]
    #'module__num_units': [10, 20],
}
gs = GridSearchCV(net, params, refit=False, cv=5, scoring='accuracy')
gs.fit(train_ds, y=None)
print(gs.best_score_, gs.best_params_)

Did I miss any other information? I am quite new to Skorch. Feedback much appreciated. I sent you a tweet on this :)

Originally posted by @Tsakunelson in https://github.com/dnouri/skorch/issues/282#issuecomment-470359481

thomasjpfan commented 5 years ago

Using GridSearchCV fails in this case because sklearn does not have the concept of a pytorch Dataset. If your data can fit into memory, you can place your data into numpy arrays, X and y, and use them to train your model.

BenjaminBossan commented 5 years ago

Maybe we could provide a wrapper for Datasets so that they can work with grid search, similar to our SliceDict. However, that wrapper would be more complex because it would probably need to rely on pytorch's default_collate to get from individual samples to batches.

Tsakunelson commented 5 years ago

Using GridSearchCV fails in this case because sklearn does not have the concept of a pytorch Dataset. If your data can fit into memory, you can place your data into numpy arrays, X and y, and use them to train your model.

The data is really large. It can't fit into memory all at once. It consists of thousands of 256*256 image patches, extracted from Whole Slide Images. In the PyData presentation you gave, how do you choose your best model's hyper parameters? was that through GridSearch? If so how did you load the large chunk of data for the Kaggle nuclei segmentation experiment for instance?

Tsakunelson commented 5 years ago

Maybe we could provide a wrapper for Datasets so that they can work with grid search, similar to our SliceDict. However, that wrapper would be more complex because it would probably need to rely on pytorch's default_collate to get from individual samples to batches.

This would be helpful @BenjaminBossan. Can I have a temporary fix if at all you've dealt with this issue in the past? I am quite new with Skorch

thomasjpfan commented 5 years ago

@Tsakunelson Since neural networks are already so overparameterize, exploring the hyperparameter space doesn't really give you too much performance. Usually trying out different architectures and using techniques such as Cyclic learning rates, Mixup data augmentation, or Test Time augmentation will be more fruitful.

BenjaminBossan commented 5 years ago

Thomas is right that as soon as you have a large dataset, doing a grid search is not viable (except if you have a ton of resources or a lot of time). Mostly it is best to adjust hyper parameters by hand. E.g. you know that if you decrease the learning rate, you need to train more epochs, whereas a naive grid search just tries all combinations.

That being said, I have hacked together a solution that should hopefully work. It should be helpful even if you don't use grid search -- e.g. it should work with sklearn's cross_val_predict etc.

# using the example from the README
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import GridSearchCV
from torch import nn
import torch.nn.functional as F
from torch.utils.data.dataloader import default_collate

from skorch import NeuralNetClassifier
from skorch.dataset import Dataset
from skorch.dataset import CVSplit

X, y = make_classification(1000, 20, n_informative=10, random_state=0)
X = X.astype(np.float32)
y = y.astype(np.int64)

class MyModule(nn.Module):
    def __init__(self, num_units=10, nonlin=F.relu):
        super(MyModule, self).__init__()

        self.dense0 = nn.Linear(20, num_units)
        self.nonlin = nonlin
        self.dropout = nn.Dropout(0.5)
        self.dense1 = nn.Linear(num_units, 10)
        self.output = nn.Linear(10, 2)

    def forward(self, X, **kwargs):
        X = self.nonlin(self.dense0(X))
        X = self.dropout(X)
        X = F.relu(self.dense1(X))
        X = F.softmax(self.output(X), dim=-1)
        return X

# pack the data into a Dataset
class MyDataset(Dataset):
    def __init__(self, X, y):
        self.X = X
        self.y = y

    def __len__(self):
        return len(self.X)

    def __getitem__(self, i):
        Xi = self.X[i]
        yi = self.y[i]
        return self.transform(Xi, yi)

ds = MyDataset(X, y)

class SliceDatasetX(Dataset):
    """Helper class that wraps a torch dataset to make it work with sklearn"""
    def __init__(self, dataset, collate_fn=default_collate):
        self.dataset = dataset
        self.collate_fn = collate_fn

        self._indices = list(range(len(self.dataset)))

    def __len__(self):
        return len(self.dataset)

    @property
    def shape(self):
        return len(self),

    def __getitem__(self, i):
        if isinstance(i, (int, np.integer)):
            Xb = self.transform(*self.dataset[i])[0]
            return Xb

        if isinstance(i, slice):
            i = self._indices[i]

        Xb = self.collate_fn([self.transform(*self.dataset[j])[0] for j in i])
        return Xb

params = {
    'lr': [0.01, 0.02],
    'max_epochs': [10, 20],
    'module__num_units': [10, 20],
}
net = NeuralNetClassifier(
    MyModule,
    max_epochs=10,
    lr=0.1,
    iterator_train__shuffle=True,
    verbose=False,
    train_split=None,
)
gs = GridSearchCV(net, params, refit=False, cv=3, scoring='accuracy')

# we have to extract the target data, otherwise sklearn will complain
y_from_ds = np.asarray([ds[i][1] for i in range(len(ds))])
# wrap the dataset into the new helper class
ds_sliceable = SliceDatasetX(ds)

gs.fit(ds_sliceable, y_from_ds)

Note that this implementation might not be the most efficient, but at least for me it worked.

Tsakunelson commented 5 years ago

Thank you @BenjaminBossan it works great with this wrapper. Either way, I will also go in for Cyclic learning rates and Test Time augmentation as best hyperparameter search techniques using just the NeuralNet.fit() in case I run short of time. @thomasjpfan can I have linky resources for the stated techniques? Thanks

thomasjpfan commented 5 years ago

Our Unet tutorial uses Cyclic learning rates with skorch. TTA has been discussed on the skorch issue board here.

Tsakunelson commented 5 years ago

@thomasjpfan @BenjaminBossan have you by any chance implemented the inception_v3 model with Skorch? I receive a RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM When I change the resnet18 model to inception_v3 in the NeuralNet module. image

Actually, I am implementing the "Metastatic Beast Cancer detection paper" which won the first price in the 2017 CAMELYON challenge with an accuracy of over 0.9. I receive a validation accuracy of just 0.79 with resnet18 and I believe the performance should improve with inception_V3.

Any help?

BenjaminBossan commented 5 years ago

This error actually doesn't seem to be related to skorch. I would assume that the same thing would happen without skorch. Please try to search for a general answer. For example, check that the input image size is correct when you switch the model.

If you find that the error only occurs in skorch, please open a new issue, since it's not related to the original topic.

BenjaminBossan commented 5 years ago

@Tsakunelson Any updates on this? Otherwise I assume that the problem has been solved and I'll close this issue in a couple of days.