keras-team / keras-preprocessing

Utilities for working with image data, text data, and sequence data.
Other
1.02k stars 444 forks source link

can now disable augmentation #256

Closed cbernet closed 4 years ago

cbernet commented 4 years ago

Summary

Hi, this PR makes it possible to disable data augmentation.

This feature can be needed for the validation subset for two reasons:

A new parameter has been added to the flow methods, apply_augmentation. This parameter is optional, and set to True by default, so that the current interface and behaviour is preserved. It can affect either the full set, the validation subset, or the training subset, depending on the subset argument.

In issue #218 we discussed the possibility to introduce a parameter to disable augmentation for the validation set only. However, after a detailed look at the code, I concluded that it would be more logical to introduce this feature for both subsets in a generic way.

The main reason for this decision is that a parameter like disable_augmentation_for_validation would have been rather weird, and more difficult to explain in the documentation to the user.

This has the additional advantage of making it possible to disable augmentation completely for the training subset as well with a single line, e.g. when comparing augmentation vs no augmentation.

Please let me know what you think, I'll be happy to iterate on this.

Thanks!

Related Issues

218

PR Overview

cbernet commented 4 years ago

There might be something wrong with CI : the checks that do not pass seem to have nothing to do with my changes.

Dref360 commented 4 years ago

This seems related to your changes: https://travis-ci.org/keras-team/keras-preprocessing/jobs/602635040#L1097

cbernet commented 4 years ago

@Dref360 thanks, I had missed that, and this made me realize that I need to ensure python 2.X compatibility. This should be ok now, let's see.

Dref360 commented 4 years ago

Sorry for the unacceptable delay. I got caught up at work.

To disable augmentation can't we just make a new ImageDataGenerator with no augmentation?

In the example below, both flow1, flow2 will have the same data, but no augmentation.

What do you think?

Example

import os

import numpy as np
from PIL import Image
from keras.preprocessing.image import ImageDataGenerator

pjoin = os.path.join

cls1_pt = 'tmp/cls1/cls1'
cls2_pt = 'tmp/cls1/cls2'
os.makedirs('tmp', exist_ok=True)
os.makedirs(cls1_pt, exist_ok=True)
os.makedirs(cls2_pt, exist_ok=True)

for i in range(100):
    img = Image.fromarray((np.ones([100, 100, 3]) * i % 255).astype(np.uint8))
    img.save(pjoin(cls1_pt, '{}.png'.format(i)))
    img.save(pjoin(cls2_pt, '{}.png'.format(i)))

flow1 = ImageDataGenerator(rotation_range=10, horizontal_flip=True).flow_from_directory('tmp/cls1',
                                                                                        seed=1337,
                                                                                        shuffle=False)
flow2 = ImageDataGenerator().flow_from_directory('tmp/cls1', seed=1337, shuffle=False)

assert flow1.filenames == flow2.filenames
cbernet commented 4 years ago

Hi! no worries, really.

In fact, we need a single ImageDataGenerator to be able to split the data into the training and validation subsets. For instance, in your example, all images appear in both of your flows, while we want to keep the training and validation subsets separated.

In principle, it's possible to make the solution you propose work if the user:

I think that's very error prone and anyway, we need shuffling.

From what I've seen, what people are doing at the moment is to split validation and training datasets physically on the disk, but that's very impractical. For instance, if you want to change the fraction of examples used for validation, you have to re-organize the data on the disk accordingly.

Dref360 commented 4 years ago

I'm pretty sure this is deterministic? Do you have repro steps? I agree that this should be better documented. Below, I compare both the validation and training set. Shuffle is only used for batch ordering I think? We sort the images before taking validation_split% of them. So I guess this is deterministic?

import os

import numpy as np
from PIL import Image
from keras.preprocessing.image import ImageDataGenerator

pjoin = os.path.join

cls1_pt = 'tmp/cls1/cls1'
cls2_pt = 'tmp/cls1/cls2'
os.makedirs('tmp', exist_ok=True)
os.makedirs(cls1_pt, exist_ok=True)
os.makedirs(cls2_pt, exist_ok=True)

for i in range(100):
    img = Image.fromarray((np.ones([100, 100, 3]) * i % 255).astype(np.uint8))
    img.save(pjoin(cls1_pt, '{}.png'.format(i)))
    img.save(pjoin(cls2_pt, '{}.png'.format(i)))

for phase in ['training', 'validation']:
    flow1 = ImageDataGenerator(rescale=1 / 255.,
                               rotation_range=10, horizontal_flip=True,  validation_split=0.7).flow_from_directory('tmp/cls1', shuffle=True, subset=phase)
    flow2 = ImageDataGenerator(rescale=1 / 255.,  validation_split=0.7).flow_from_directory('tmp/cls1', shuffle=True, subset=phase)

    assert flow1.filenames == flow2.filenames
rragundez commented 4 years ago

I agree with @Dref360 , what he proposes will achieved the desired functionality, and II think is more inline with the concept of what an ImageDataGenerator does. If you set augmentation, it will augment images, regardless. Can we close this?

Dref360 commented 4 years ago

Thank for your PR, because there is an easy workaround we will close for now.