Update ImageDataGenerator to image_dataset_from_directory

nazmi commented 2 years ago

This will greatly save time for some students instead of waiting for the kernel to finish the run. As a bonus, to teach students how to pipeline the input better. Thanks to your "documentation reading" skills, I have learned to improve things because I hate to wait.

Current behaviour

Your code in 03_convolutional_neural_networks_in_tensorflow uses ImageDataGenerator. As you said, this method causes more delay in every epoch because it augments the data when it loaded the model.

train_datagen_augmented = ImageDataGenerator(rescale=1/255.,
                                             rotation_range=20, 
                                             shear_range=0.2,
                                             zoom_range=0.2, 
                                             width_shift_range=0.2, 
                                             height_shift_range=0.2, 
                                             horizontal_flip=True

train_data_augmented = train_datagen_augmented.flow_from_directory(train_dir,
                                                                   target_size=(224, 224),
                                                                   batch_size=32,
                                                                   class_mode='categorical')

Suggestion

Since tf.keras.preprocessing is marked as deprecated (but still usable), maybe we can use tf.keras.utils.image_dataset_from_directory as to utilize tf.data.Dataset.


from tensorflow.keras import layers, models
tf.random.set_seed(42)

train_augmented = tf.keras.utils.image_dataset_from_directory(
    train_dir,
    label_mode='categorical',
    image_size=(256, 256),
    batch_size=32,
    seed=42
)

preprocessing_layer  = models.Sequential([
    layers.Rescaling(1./255),
    layers.RandomFlip(mode="horizontal"),
    layers.RandomRotation((0.0, 0.01)),
    layers.RandomZoom((0.0, 0.1)),
    layers.RandomTranslation((0.0, 0.1), (0.0, 0.1))

])

train_augmented = (
    train_augmented
    .map(
        lambda x, y: (preprocessing_layer(x), y),
        num_parallel_calls=tf.data.AUTOTUNE
        )
    .cache()
    .prefetch(buffer_size=tf.data.AUTOTUNE)

)

Benchmark

Tested on multiclass classification with 20 classes without sampling. 15000 training files and 5000 validation files. Although this dataset is not similar to the notebook example, it shows how it can benefit smaller datasets as well.

ImageDataGenerator takes 11 minutes and 19.4 seconds. image_dataset_from_directory takes 1 minute and 37.5 seconds. Improvement speed 6.97 times over normal run.

System information

OS Platform and Distribution: Windows 10
TensorFlow version: v2.8.0
GPU model and memory: RTX 3060 Ti 8GB

nazmi commented 2 years ago

Finished reading 05_finetuning and just realised you used it from there. I guess I got curious too early.

mrdbourke commented 2 years ago

@nazmi thank you for the suggestion :)

I'm glad you found the updated data loading in the later notebook, I was going to comment something similar.

It's good to be curious!

mrdbourke / tensorflow-deep-learning

Update ImageDataGenerator to image_dataset_from_directory #368