Problem with ImageDataGenerator flow_from_directory and categorical classification

drscarlat commented 6 years ago

I am trying to classify the Kaggle 10k photos of dogs into 120 breeds. Using flow_from_directory and a training and validation directories, I find the validation accuracy to be fixed, low at 1/120 = 0.0084...basically random prediction.

SO: https://stackoverflow.com/questions/52251199/low-validation-accuracy-with-good-training-accuracy-keras-imagedatagenerator-f

I've tried various learning rates, convoluted models, batch sizes and even replaced the train with valid - always the train accuracy improves as expected while the validation is low and fixed.

When I'm not using the fit_generator and the flow_from_directory (just fit and flow) - the above issue doesn't exist. But without the generator, I cannot use data augmentation, a good way to fight overfitting.

Please make sure that the boxes below are checked before you submit your issue. If your issue is an implementation question, please ask your question on StackOverflow or join the Keras Slack channel and ask there instead of filing a GitHub issue.

Thank you!

[ ] Check that you are up-to-date with the master branch of Keras. You can update with: pip install git+git://github.com/keras-team/keras.git --upgrade --no-deps
[ ] If running on TensorFlow, check that you are up-to-date with the latest version. The installation instructions can be found here.
[ ] If running on Theano, check that you are up-to-date with the master branch of Theano. You can update with: pip install git+git://github.com/Theano/Theano.git --upgrade --no-deps
[ ] Provide a link to a GitHub Gist of a Python script that can reproduce your issue (or just copy the script here if it is short).

drscarlat commented 6 years ago

I am trying to classify the Kaggle 10k dog images to 120 breeds using Keras and ResNet50. Due to memory constraints at Kaggle (14gb ram) - I have to use the ImageDataGenerator that feeds the images to the model and also allows data augmentation - in real time.

The base convoluted ResNet50 model:

conv_base = ResNet50(weights='imagenet', include_top=False, input_shape=(224,224, 3)) My model:

model = models.Sequential() model.add(conv_base) model.add(layers.Flatten()) model.add(layers.Dense(256, activation='relu')) model.add(layers.Dropout(0.5)) model.add(layers.Dense(120, activation='softmax')) Making sure that only my last added layers are trainable - so the ResNet50 original weights will not be modified in the training process and compiling model:

conv_base.trainable = False model.compile(optimizer=optimizers.Adam(), loss='categorical_crossentropy',metrics=['accuracy'])

Num trainable weights BEFORE freezing the conv base: 216 Num trainable weights AFTER freezing the conv base: 4 And the final model summary:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
resnet50 (Model)             (None, 1, 1, 2048)        23587712  
_________________________________________________________________
flatten_1 (Flatten)          (None, 2048)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 256)               524544    
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 120)               30840     
=================================================================
Total params: 24,143,096
Trainable params: 555,384
Non-trainable params: 23,587,712
_________________________________________________________________

The train and validation directories have each, 120 sub directories - one for each dog breed. In these folders are images of dogs. Keras is supposed to use these directories to get the correct label for each image: so an image from a "beagle" sub dir is classified automatically by Keras - no need for one-hot-encoding or anything like that.

train_dir = '../input/dogs-separated/train_dir/train_dir/'
validation_dir = '../input/dogs-separated/validation_dir/validation_dir/'

train_datagen = ImageDataGenerator(rescale=1./255)
test_datagen = ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_directory(
train_dir,target_size=(224, 224),batch_size=20, shuffle=True)
validation_generator = test_datagen.flow_from_directory(
validation_dir,target_size=(224, 224),batch_size=20, shuffle=True)

Found 8185 images belonging to 120 classes. Found 2037 images belonging to 120 classes. Just to make sure these classes are right and in the right order I've compared their train_generator.class_indices and validation_generator.class_indices - and they are the same. Train the model:

history = model.fit_generator(train_generator,
steps_per_epoch=8185 // 20,epochs=10,
validation_data=validation_generator,
validation_steps=2037 // 20)

Note in the charts below, that while training accuracy improves as expected - the validation sets quickly around 0.008 which is 1/120...RANDOM prediction ?!

drscarlat commented 6 years ago

ooops...closed issue by mistake

drscarlat commented 6 years ago

I've played with the batch size and found 120 (number of directories in the train as well as the valid directories) to eliminate the above issue. Now I can happily employ data augmentation techniques without crashing my Kaggle kernel on memory issues. Still I wonder...

How is Keras ImageDataGenerator sampling images from a directory - depth or breadth wise ?

If depth wise - than with a batch size of 20 it will go thru the first directory of say 100 photos (five times) FIRST, and then move to the next directory, do it in batches of 20, move to the next dir... Or is it breadthwise ? The initial batch of 20 will be one photo from each of the first 20 directories and then the next 20 directories ? I couldn't find in the documentation how is Keras ImageDataGenerator working with batches when used with flow_from _directory and fit_generator.

drscarlat commented 6 years ago

Keras was updated at Kaggle to 2.2.2 - so this fact may have helped solve the issue ?

SubmitCode commented 5 years ago

@drscarlat I have exactly the same problem but I still don't know why it occures

fariagu commented 5 years ago

@drscarlat I think I'm having the same problem described here. Did you fix it solely by incresing the batch size? You mentioned 120, but what values were you using prior to that? Just as a frame of reference to see by how much you increased it.

I can in fact see some benefits from that, my validation accuracy was 0.0000 (which as you can tell is even worse than random) with a batch size of 32 or 64. A batch size of 512, however, makes the validation accuracy stagnate at 0.0096 after about 10 epochs (which in my case is just about random), but I can't go any further if I'm using an ImageDataGenerator even if I'm not doing any transformations on the dataset

akihiro-inui commented 5 years ago

Are there any updates on this problem? I have exactly the same issue.

keras-team / keras

Problem with ImageDataGenerator flow_from_directory and categorical classification #11211