Closed iCHAIT closed 3 years ago
That's a good question. There is probably not a simple way to achieve this yet, which is why we put this issue on our roadmap.
Could you be more specific about how you want to augment your data (e.g. what libraries you want to use)?
There is also some debate in #362 but it's not exactly the same issue.
So, what I am trying to achieve is actually quite similar to the transfer learning example of skorch (linked above).
I have some images and I am trying to do multi-class classification using a Resnet
I am trying to apply the following transformation (quite similar as in the above example) -
from torchvision import transforms
train_transforms = transforms.Compose([
transforms.Resize(256),
transforms.RandomRotation(degrees=15),
transforms.RandomHorizontalFlip(),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225])
])
val_transforms = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225])
])
net = NeuralNetClassifier(
PretrainedModel,
....
)
Then instead of training/validating on fixed sets -
## Here `X_train` would have gone through the transformation - `train_transforms`
net.fit(X_train, y_train)
I want to do cross-validation and I am wondering if skorch can do the heavy lifting of applying the respective transformations for training and validation folds, on the fly -
scores = cross_validate(net, X, y, scoring=('accuracy'), cv=k)
(Here, of course, X and y is the complete data)
From the API point of view, I guess NeuralNetClassifier
can take the relevant transformation strategies for train and val folds as arguments. And consequently, apply the relevant transformation on the fly to the respective folds.
Hope that makes sense.
Thanks for providing more context. Could you tell me a bit about your data? Is it contained in a numpy array, are you using something from PyTorch vision?
I'm thinking about this use case but couldn't come up with any easy solution just yet. The main difficulty for me is that different transforms are to be applied to the training data and to the validation/test data. At the moment, we don't make the differentiation at this step, but I see no reason not to.
@ottonemo @thomasjpfan What's your opinion on providing the training=True/False
argument to get_dataset
? With the standard implementation, it would be unused, but for a case like this, it could be useful. Specifically, I'm thinking about overriding get_dataset
with something like this:
def get_dataset(self, X, y=None, training=False):
dataset = super.get_dataset(X, y, training=training)
if training:
return apply_train_transforms(dataset)
return apply_test_transforms(dataset)
@BenjaminBossan
Is it contained in a numpy array,
Essentially, yes!
So basically, for the original dataset, I have annotations for the images stored in a CSV file with 2 columns -
Image_Path
- Path of the imageLabel/Target
- 1/2/3/4I break this CSV file into train.csv
(90%) and test.csv
(10%) using sklearn's train_test_split
(So essentially my train and test sets are fixed)
After obtaining the two files, I am using the following code to prepare my data for training in pytorch -
class MyDataset(Dataset):
def __init__(self, csv_file, transform=None):
self.annotations = pd.read_csv(csv_file)
self.transform = transform
def __len__(self):
return len(self.annotations)
def __getitem__(self, index):
img_path = self.annotations.iloc[index, 0]
image = Image.open(img_path)
image = image.convert('RGB')
y_label = torch.tensor(int(self.annotations.iloc[index, 1]))
if self.transform:
image = self.transform(image)
return (image, label)
train_transforms = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.RandomRotation(degrees=15),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225])
])
val_transforms = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225])
])
train_dataset = MyDataset(csv_file = 'train.csv', transform = train_transforms)
val_dataset = MyDataset(csv_file = 'test.csv', transform = val_transforms)
train_set = []
val_set = []
for dataset in ['train', 'val']:
if dataset == 'train':
data_loader = DataLoader(dataset=train_dataset)
else:
data_loader = DataLoader(dataset=val_dataset)
for image, target in data_loader:
image = image.to(device=device)
target = target.to(device=device)
# Passing images through the respective trained resnets and extracting features from each image
feature = resnet_feature_extractor(image).to(device)
# Second last layer of ResNet has 512 elements
final_feat_vec = feature.view(512)
if dataset == 'train':
train_set.append((final_feat_vec.cpu().detach().numpy(), target.cpu().detach().numpy()))
else:
val_set.append((final_feat_vec.cpu().detach().numpy(), target.cpu().detach().numpy()))
Now, after obtaining the train_set
and val_set
I create train_loader
and test_loader
out of these using torch.utils.data.DataLoader
Finally, I create a simple feedforward neural network for training on train_loader
and then test on test_loader
.
Apologies for the long code, but I guess it would make things clear. Please let me know if you want more info.
What's your opinion on providing the training=True/False argument to get_dataset?
I would be +0, because I would prefer the augmentations be handled outside of the NeuralNet
object. Technically, I think it may even be more complicated because the training set can also be split into a validation set. This validation set most likely would want not RandomRotation
, etc.
I would prefer the augmentations be handled outside of the
NeuralNet
object
I totally agree on this. Unfortunately, we don't have a good way of doing this yet.
because the training set can also be split into a validation set
Do you mean for the skorch-internal validation set? That should call get_dataset
with training=False
.
I was thinking about the line:
scores = cross_validate(net, X, y, scoring=('accuracy'), cv=k)
with the external skorch validation set. The get_dataset
with training=False/True
would handle training and the skorch-internal validation. I think there could also be preprocessing during test time, aka during predict
. We can always keep it simple and assume that get_dataset(training=False)
is the same for validation and testing (predict
).
I think there could also be preprocessing during test time, aka during
predict
.
Yeah, we had the discussion at one point with @ottonemo about whether there should be a 3rd option, i.e. train vs validation vs predict. Can't find it right now. I believe it's not easy to implement, since 1) we use a boolean right now and 2) we mirror the training vs eval distinction that PyTorch uses as well.
We can always keep it simple and assume that
get_dataset(training=False)
is the same for validation and testing (predict
).
At least to me, it doesn't sound like we would lose anything if we did.
I thought some more about this and I believe we don't need to adjust get_dataset
since the same can essentially be achieved through get_iterator
. So the code could be adjusted roughly like this:
train_dataset = MyDataset(csv_file = 'train.csv', transform=None)
val_dataset = MyDataset(csv_file = 'test.csv', transform=None)
class MyNet(NeuralNet):
def get_iterator(self, dataset, training=False):
if training:
dataset.transform = train_transforms
else:
dataset.transform = val_transforms
return super().get_iterator(dataset, training=training)
Maybe we can elaborate on this approach, making it more generic so that this is doable without overriding get_iterator
.
The problem with implementing this on get_dataset
would be that the skorch-internal train/valid split is performed after get_dataset
is called, so any transforms attached during get_dataset
would be used for both train and valid.
@iCHAIT did you make any progress on this?
@BenjaminBossan I haven't made any progress on this sadly.
Too bad. Did you try out my suggestion above? If it didn't work, please post the error and maybe we can help out.
I haven't tried that yet, I got pulled into different things and couldn't spend more time on this. Feel free to close this issue for now if you want. I shall give it a try when I have some more time on my hands.
Appreciate the help and the discussion on this :)
Okay, I'll close for now. Feel free to re-open if you tried something new or have further questions.
The way I ended up solving this in my case was to create an iterator factory function, something like this:
from copy import copy
import torch
def make_iterator(dataset, training, **kwargs):
predict = getattr(dataset, 'y', object()) is None
if not training:
dataset = copy(dataset)
dataset.transform = None
return torch.utils.data.DataLoader(dataset, **kwargs)
This I would parameterize and use both as iterator_train
and iterator_test
, similar to so:
net = NeuralNet(
# stuff...,
iterator_train=partial(make_iterator, training=True),
iterator_test=partial(make_iterator, training=False),
dataset=SomeDataSet(transform=MyTransform()), # a DataSet with transforms
)
Daniel, thanks for posting your solution. Indeed, this is easier than overriding get_iterator
as I had suggested.
predict = getattr(dataset, 'y', object()) is None
This line is unnecessary for the example, right?
dataset = copy(dataset)
Good addition, in my original example I mutated the dataset.
How can I do Data augmentation for the training folds for each iteration of cross-validation?
Suppose I am doing a 5 fold CV, so I want to do data augmentation (rotation and horizontal flip) for the training data/folds only and do an evaluation on the validation fold (without applying data augmentation to this fold). And then repeat this process accordingly. How can I do this with skorch?
I found these 2 links that do data augmentation in skorch, but they are not performing cross-validation -