Closed cbernet closed 4 years ago
There might be something wrong with CI : the checks that do not pass seem to have nothing to do with my changes.
This seems related to your changes: https://travis-ci.org/keras-team/keras-preprocessing/jobs/602635040#L1097
@Dref360 thanks, I had missed that, and this made me realize that I need to ensure python 2.X compatibility. This should be ok now, let's see.
Sorry for the unacceptable delay. I got caught up at work.
To disable augmentation can't we just make a new ImageDataGenerator with no augmentation?
In the example below, both flow1, flow2
will have the same data, but no augmentation.
What do you think?
import os
import numpy as np
from PIL import Image
from keras.preprocessing.image import ImageDataGenerator
pjoin = os.path.join
cls1_pt = 'tmp/cls1/cls1'
cls2_pt = 'tmp/cls1/cls2'
os.makedirs('tmp', exist_ok=True)
os.makedirs(cls1_pt, exist_ok=True)
os.makedirs(cls2_pt, exist_ok=True)
for i in range(100):
img = Image.fromarray((np.ones([100, 100, 3]) * i % 255).astype(np.uint8))
img.save(pjoin(cls1_pt, '{}.png'.format(i)))
img.save(pjoin(cls2_pt, '{}.png'.format(i)))
flow1 = ImageDataGenerator(rotation_range=10, horizontal_flip=True).flow_from_directory('tmp/cls1',
seed=1337,
shuffle=False)
flow2 = ImageDataGenerator().flow_from_directory('tmp/cls1', seed=1337, shuffle=False)
assert flow1.filenames == flow2.filenames
Hi! no worries, really.
In fact, we need a single ImageDataGenerator to be able to split the data into the training and validation subsets. For instance, in your example, all images appear in both of your flows, while we want to keep the training and validation subsets separated.
In principle, it's possible to make the solution you propose work if the user:
I think that's very error prone and anyway, we need shuffling.
From what I've seen, what people are doing at the moment is to split validation and training datasets physically on the disk, but that's very impractical. For instance, if you want to change the fraction of examples used for validation, you have to re-organize the data on the disk accordingly.
I'm pretty sure this is deterministic? Do you have repro steps? I agree that this should be better documented.
Below, I compare both the validation and training set. Shuffle is only used for batch ordering I think? We sort the images before taking validation_split
% of them. So I guess this is deterministic?
import os
import numpy as np
from PIL import Image
from keras.preprocessing.image import ImageDataGenerator
pjoin = os.path.join
cls1_pt = 'tmp/cls1/cls1'
cls2_pt = 'tmp/cls1/cls2'
os.makedirs('tmp', exist_ok=True)
os.makedirs(cls1_pt, exist_ok=True)
os.makedirs(cls2_pt, exist_ok=True)
for i in range(100):
img = Image.fromarray((np.ones([100, 100, 3]) * i % 255).astype(np.uint8))
img.save(pjoin(cls1_pt, '{}.png'.format(i)))
img.save(pjoin(cls2_pt, '{}.png'.format(i)))
for phase in ['training', 'validation']:
flow1 = ImageDataGenerator(rescale=1 / 255.,
rotation_range=10, horizontal_flip=True, validation_split=0.7).flow_from_directory('tmp/cls1', shuffle=True, subset=phase)
flow2 = ImageDataGenerator(rescale=1 / 255., validation_split=0.7).flow_from_directory('tmp/cls1', shuffle=True, subset=phase)
assert flow1.filenames == flow2.filenames
I agree with @Dref360 , what he proposes will achieved the desired functionality, and II think is more inline with the concept of what an ImageDataGenerator
does. If you set augmentation, it will augment images, regardless.
Can we close this?
Thank for your PR, because there is an easy workaround we will close for now.
Summary
Hi, this PR makes it possible to disable data augmentation.
This feature can be needed for the validation subset for two reasons:
A new parameter has been added to the
flow
methods,apply_augmentation
. This parameter is optional, and set toTrue
by default, so that the current interface and behaviour is preserved. It can affect either the full set, the validation subset, or the training subset, depending on the subset argument.In issue #218 we discussed the possibility to introduce a parameter to disable augmentation for the validation set only. However, after a detailed look at the code, I concluded that it would be more logical to introduce this feature for both subsets in a generic way.
The main reason for this decision is that a parameter like
disable_augmentation_for_validation
would have been rather weird, and more difficult to explain in the documentation to the user.This has the additional advantage of making it possible to disable augmentation completely for the training subset as well with a single line, e.g. when comparing augmentation vs no augmentation.
Please let me know what you think, I'll be happy to iterate on this.
Thanks!
Related Issues
218
PR Overview