cbernet commented 4 years ago

Summary

Hi, this PR makes it possible to disable data augmentation.

This feature can be needed for the validation subset for two reasons:

Random data augmentation degrades the validation subset, which means that the true validation accuracy is in fact higher than the one currently obtained from keras. For instance, in machine learning challenges, the test dataset is not augmented. Also, in real life, we apply our models to datasets that are not augmented.
Due to augmentation, the validation subset changes at each epoch which leads to random fluctuations in the validation accuracy as a function of the epoch.

A new parameter has been added to the flow methods, apply_augmentation. This parameter is optional, and set to True by default, so that the current interface and behaviour is preserved. It can affect either the full set, the validation subset, or the training subset, depending on the subset argument.

In issue #218 we discussed the possibility to introduce a parameter to disable augmentation for the validation set only. However, after a detailed look at the code, I concluded that it would be more logical to introduce this feature for both subsets in a generic way.

The main reason for this decision is that a parameter like disable_augmentation_for_validation would have been rather weird, and more difficult to explain in the documentation to the user.

This has the additional advantage of making it possible to disable augmentation completely for the training subset as well with a single line, e.g. when comparing augmentation vs no augmentation.

Please let me know what you think, I'll be happy to iterate on this.

Thanks!

Related Issues

218

PR Overview

[ y] This PR requires new unit tests [y/n] (make sure tests are included)
[ y] This PR requires to update the documentation [y/n] (make sure the docs are up-to-date)
[ y] This PR is backwards compatible [y/n]
[ n] This PR changes the current API (I assume that new optional parameters are not considered a change to the API, but that's obviously your call)

cbernet commented 4 years ago

There might be something wrong with CI : the checks that do not pass seem to have nothing to do with my changes.

Dref360 commented 4 years ago

This seems related to your changes: https://travis-ci.org/keras-team/keras-preprocessing/jobs/602635040#L1097

cbernet commented 4 years ago

@Dref360 thanks, I had missed that, and this made me realize that I need to ensure python 2.X compatibility. This should be ok now, let's see.

Dref360 commented 4 years ago

Sorry for the unacceptable delay. I got caught up at work.

To disable augmentation can't we just make a new ImageDataGenerator with no augmentation?

In the example below, both flow1, flow2 will have the same data, but no augmentation.

What do you think?

Example

import os

import numpy as np
from PIL import Image
from keras.preprocessing.image import ImageDataGenerator

pjoin = os.path.join

cls1_pt = 'tmp/cls1/cls1'
cls2_pt = 'tmp/cls1/cls2'
os.makedirs('tmp', exist_ok=True)
os.makedirs(cls1_pt, exist_ok=True)
os.makedirs(cls2_pt, exist_ok=True)

for i in range(100):
    img = Image.fromarray((np.ones([100, 100, 3]) * i % 255).astype(np.uint8))
    img.save(pjoin(cls1_pt, '{}.png'.format(i)))
    img.save(pjoin(cls2_pt, '{}.png'.format(i)))

flow1 = ImageDataGenerator(rotation_range=10, horizontal_flip=True).flow_from_directory('tmp/cls1',
                                                                                        seed=1337,
                                                                                        shuffle=False)
flow2 = ImageDataGenerator().flow_from_directory('tmp/cls1', seed=1337, shuffle=False)

assert flow1.filenames == flow2.filenames

cbernet commented 4 years ago

Hi! no worries, really.

In fact, we need a single ImageDataGenerator to be able to split the data into the training and validation subsets. For instance, in your example, all images appear in both of your flows, while we want to keep the training and validation subsets separated.

In principle, it's possible to make the solution you propose work if the user:

disables shuffling, and sets the same seed for both generators as you did
splits the two generators in the same way, say at 0.2
uses the training subset of the first generator and the validation subset of the second one

I think that's very error prone and anyway, we need shuffling.

From what I've seen, what people are doing at the moment is to split validation and training datasets physically on the disk, but that's very impractical. For instance, if you want to change the fraction of examples used for validation, you have to re-organize the data on the disk accordingly.

Dref360 commented 4 years ago

I'm pretty sure this is deterministic? Do you have repro steps? I agree that this should be better documented. Below, I compare both the validation and training set. Shuffle is only used for batch ordering I think? We sort the images before taking validation_split% of them. So I guess this is deterministic?

import os

import numpy as np
from PIL import Image
from keras.preprocessing.image import ImageDataGenerator

pjoin = os.path.join

cls1_pt = 'tmp/cls1/cls1'
cls2_pt = 'tmp/cls1/cls2'
os.makedirs('tmp', exist_ok=True)
os.makedirs(cls1_pt, exist_ok=True)
os.makedirs(cls2_pt, exist_ok=True)

for i in range(100):
    img = Image.fromarray((np.ones([100, 100, 3]) * i % 255).astype(np.uint8))
    img.save(pjoin(cls1_pt, '{}.png'.format(i)))
    img.save(pjoin(cls2_pt, '{}.png'.format(i)))

for phase in ['training', 'validation']:
    flow1 = ImageDataGenerator(rescale=1 / 255.,
                               rotation_range=10, horizontal_flip=True,  validation_split=0.7).flow_from_directory('tmp/cls1', shuffle=True, subset=phase)
    flow2 = ImageDataGenerator(rescale=1 / 255.,  validation_split=0.7).flow_from_directory('tmp/cls1', shuffle=True, subset=phase)

    assert flow1.filenames == flow2.filenames

rragundez commented 4 years ago

I agree with @Dref360 , what he proposes will achieved the desired functionality, and II think is more inline with the concept of what an ImageDataGenerator does. If you set augmentation, it will augment images, regardless. Can we close this?

Dref360 commented 4 years ago

Thank for your PR, because there is an easy workaround we will close for now.

keras-team / keras-preprocessing

can now disable augmentation #256

Summary

Related Issues

218

PR Overview

Example