skorch-dev / skorch

A scikit-learn compatible neural network library that wraps PyTorch
BSD 3-Clause "New" or "Revised" License
5.82k stars 388 forks source link

How to do Data Augmentation for training folds while doing cross validation? #735

Closed iCHAIT closed 3 years ago

iCHAIT commented 3 years ago

How can I do Data augmentation for the training folds for each iteration of cross-validation?

Suppose I am doing a 5 fold CV, so I want to do data augmentation (rotation and horizontal flip) for the training data/folds only and do an evaluation on the validation fold (without applying data augmentation to this fold). And then repeat this process accordingly. How can I do this with skorch?

I found these 2 links that do data augmentation in skorch, but they are not performing cross-validation -

BenjaminBossan commented 3 years ago

That's a good question. There is probably not a simple way to achieve this yet, which is why we put this issue on our roadmap.

Could you be more specific about how you want to augment your data (e.g. what libraries you want to use)?

There is also some debate in #362 but it's not exactly the same issue.

iCHAIT commented 3 years ago

So, what I am trying to achieve is actually quite similar to the transfer learning example of skorch (linked above).

I have some images and I am trying to do multi-class classification using a Resnet

I am trying to apply the following transformation (quite similar as in the above example) -

from torchvision import transforms

train_transforms = transforms.Compose([
    transforms.Resize(256),
    transforms.RandomRotation(degrees=15),
    transforms.RandomHorizontalFlip(),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], 
                                          [0.229, 0.224, 0.225])
])
val_transforms = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], 
                                             [0.229, 0.224, 0.225])
])
net = NeuralNetClassifier(
    PretrainedModel,
    ....
)

Then instead of training/validating on fixed sets -

## Here `X_train` would have gone through the transformation - `train_transforms`
net.fit(X_train, y_train)

I want to do cross-validation and I am wondering if skorch can do the heavy lifting of applying the respective transformations for training and validation folds, on the fly -

scores = cross_validate(net, X, y, scoring=('accuracy'), cv=k)

(Here, of course, X and y is the complete data)

From the API point of view, I guess NeuralNetClassifier can take the relevant transformation strategies for train and val folds as arguments. And consequently, apply the relevant transformation on the fly to the respective folds.

Hope that makes sense.

BenjaminBossan commented 3 years ago

Thanks for providing more context. Could you tell me a bit about your data? Is it contained in a numpy array, are you using something from PyTorch vision?

I'm thinking about this use case but couldn't come up with any easy solution just yet. The main difficulty for me is that different transforms are to be applied to the training data and to the validation/test data. At the moment, we don't make the differentiation at this step, but I see no reason not to.

@ottonemo @thomasjpfan What's your opinion on providing the training=True/False argument to get_dataset? With the standard implementation, it would be unused, but for a case like this, it could be useful. Specifically, I'm thinking about overriding get_dataset with something like this:

def get_dataset(self, X, y=None, training=False):
    dataset = super.get_dataset(X, y, training=training)
    if training:
        return apply_train_transforms(dataset)
    return apply_test_transforms(dataset)
iCHAIT commented 3 years ago

@BenjaminBossan

Is it contained in a numpy array,

Essentially, yes!

So basically, for the original dataset, I have annotations for the images stored in a CSV file with 2 columns -

I break this CSV file into train.csv (90%) and test.csv (10%) using sklearn's train_test_split (So essentially my train and test sets are fixed)

After obtaining the two files, I am using the following code to prepare my data for training in pytorch -

class MyDataset(Dataset):
    def __init__(self, csv_file, transform=None):
        self.annotations = pd.read_csv(csv_file)
        self.transform = transform

    def __len__(self):
        return len(self.annotations)

    def __getitem__(self, index):
        img_path = self.annotations.iloc[index, 0]

        image = Image.open(img_path)
        image = image.convert('RGB')

        y_label = torch.tensor(int(self.annotations.iloc[index, 1]))

        if self.transform:
            image = self.transform(image)

        return (image, label)

train_transforms = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.RandomRotation(degrees=15),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], 
                             [0.229, 0.224, 0.225])
    ])

val_transforms = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], 
                             [0.229, 0.224, 0.225])
    ])

train_dataset = MyDataset(csv_file = 'train.csv', transform = train_transforms)

val_dataset = MyDataset(csv_file = 'test.csv', transform = val_transforms)

train_set = []
val_set = []

for dataset in ['train', 'val']:

    if dataset == 'train':
        data_loader = DataLoader(dataset=train_dataset)
    else:
        data_loader = DataLoader(dataset=val_dataset)

    for image, target in data_loader:
        image = image.to(device=device)

        target = target.to(device=device)

        # Passing images through the respective trained resnets and extracting features from each image
        feature = resnet_feature_extractor(image).to(device)

        # Second last layer of ResNet has 512 elements
        final_feat_vec = feature.view(512)

        if dataset == 'train':
            train_set.append((final_feat_vec.cpu().detach().numpy(),  target.cpu().detach().numpy()))
        else:
            val_set.append((final_feat_vec.cpu().detach().numpy(),  target.cpu().detach().numpy()))

Now, after obtaining the train_set and val_set I create train_loader and test_loader out of these using torch.utils.data.DataLoader

Finally, I create a simple feedforward neural network for training on train_loader and then test on test_loader.

Apologies for the long code, but I guess it would make things clear. Please let me know if you want more info.

thomasjpfan commented 3 years ago

What's your opinion on providing the training=True/False argument to get_dataset?

I would be +0, because I would prefer the augmentations be handled outside of the NeuralNet object. Technically, I think it may even be more complicated because the training set can also be split into a validation set. This validation set most likely would want not RandomRotation, etc.

BenjaminBossan commented 3 years ago

I would prefer the augmentations be handled outside of the NeuralNet object

I totally agree on this. Unfortunately, we don't have a good way of doing this yet.

because the training set can also be split into a validation set

Do you mean for the skorch-internal validation set? That should call get_dataset with training=False.

thomasjpfan commented 3 years ago

I was thinking about the line:

scores = cross_validate(net, X, y, scoring=('accuracy'), cv=k)

with the external skorch validation set. The get_dataset with training=False/True would handle training and the skorch-internal validation. I think there could also be preprocessing during test time, aka during predict. We can always keep it simple and assume that get_dataset(training=False) is the same for validation and testing (predict).

BenjaminBossan commented 3 years ago

I think there could also be preprocessing during test time, aka during predict.

Yeah, we had the discussion at one point with @ottonemo about whether there should be a 3rd option, i.e. train vs validation vs predict. Can't find it right now. I believe it's not easy to implement, since 1) we use a boolean right now and 2) we mirror the training vs eval distinction that PyTorch uses as well.

We can always keep it simple and assume that get_dataset(training=False) is the same for validation and testing (predict).

At least to me, it doesn't sound like we would lose anything if we did.

BenjaminBossan commented 3 years ago

I thought some more about this and I believe we don't need to adjust get_dataset since the same can essentially be achieved through get_iterator. So the code could be adjusted roughly like this:

train_dataset = MyDataset(csv_file = 'train.csv', transform=None)
val_dataset = MyDataset(csv_file = 'test.csv', transform=None)

class MyNet(NeuralNet):
    def get_iterator(self, dataset, training=False):
        if training:
            dataset.transform = train_transforms
        else:
            dataset.transform = val_transforms
        return super().get_iterator(dataset, training=training)

Maybe we can elaborate on this approach, making it more generic so that this is doable without overriding get_iterator.

The problem with implementing this on get_dataset would be that the skorch-internal train/valid split is performed after get_dataset is called, so any transforms attached during get_dataset would be used for both train and valid.

BenjaminBossan commented 3 years ago

@iCHAIT did you make any progress on this?

iCHAIT commented 3 years ago

@BenjaminBossan I haven't made any progress on this sadly.

BenjaminBossan commented 3 years ago

Too bad. Did you try out my suggestion above? If it didn't work, please post the error and maybe we can help out.

iCHAIT commented 3 years ago

I haven't tried that yet, I got pulled into different things and couldn't spend more time on this. Feel free to close this issue for now if you want. I shall give it a try when I have some more time on my hands.

Appreciate the help and the discussion on this :)

BenjaminBossan commented 3 years ago

Okay, I'll close for now. Feel free to re-open if you tried something new or have further questions.

dnouri commented 3 years ago

The way I ended up solving this in my case was to create an iterator factory function, something like this:

from copy import copy

import torch

def make_iterator(dataset, training, **kwargs):
    predict = getattr(dataset, 'y', object()) is None
    if not training:
        dataset = copy(dataset)
        dataset.transform = None
    return torch.utils.data.DataLoader(dataset, **kwargs)

This I would parameterize and use both as iterator_train and iterator_test, similar to so:

net = NeuralNet(
    # stuff...,
    iterator_train=partial(make_iterator, training=True),
    iterator_test=partial(make_iterator, training=False),
    dataset=SomeDataSet(transform=MyTransform()),  # a DataSet with transforms
    )
BenjaminBossan commented 3 years ago

Daniel, thanks for posting your solution. Indeed, this is easier than overriding get_iterator as I had suggested.

    predict = getattr(dataset, 'y', object()) is None

This line is unnecessary for the example, right?

dataset = copy(dataset)

Good addition, in my original example I mutated the dataset.