Model fails to train with Linux and Keras 3.3.2

jonbry commented 6 months ago

The following code from Deep Learning with Python, Second Edition fails to train when using Keras 3.3.2 and TensorFlow 2.16.1 on a Linux machine (Ubuntu 20.04):

import keras
from keras import layers

import pathlib
from keras.utils import image_dataset_from_directory

new_base_dir = pathlib.Path("cats_vs_dogs_small")

train_dataset = image_dataset_from_directory(
    new_base_dir / "train",
    image_size=(180, 180),
    batch_size=32)
validation_dataset = image_dataset_from_directory(
    new_base_dir / "validation",
    image_size=(180, 180),
    batch_size=32)
test_dataset = image_dataset_from_directory(
    new_base_dir / "test",
    image_size=(180, 180),
    batch_size=32)

data_augmentation = keras.Sequential(
    [
        layers.RandomFlip("horizontal"),
        layers.RandomRotation(0.1),
        layers.RandomZoom(0.2),
    ]
)

inputs = keras.Input(shape=(180, 180, 3))
x = data_augmentation(inputs)

x = layers.Rescaling(1./255)(x)
x = layers.Conv2D(filters=32, kernel_size=5, use_bias=False)(x)

for size in [32, 64, 128, 256, 512]:
    residual = x

    x = layers.BatchNormalization()(x)
    x = layers.Activation("relu")(x)
    x = layers.SeparableConv2D(size, 3, padding="same", use_bias=False)(x)

    x = layers.BatchNormalization()(x)
    x = layers.Activation("relu")(x)
    x = layers.SeparableConv2D(size, 3, padding="same", use_bias=False)(x)

    x = layers.MaxPooling2D(3, strides=2, padding="same")(x)

    residual = layers.Conv2D(
        size, 1, strides=2, padding="same", use_bias=False)(residual)
    x = layers.add([x, residual])

x = layers.GlobalAveragePooling2D()(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs=inputs, outputs=outputs)

model.compile(loss="binary_crossentropy",
              optimizer="rmsprop",
              metrics=["accuracy"])

history = model.fit(
    train_dataset,
    epochs=100,
    validation_data=validation_dataset)

The accuracy over 100 epochs hovers around 50%: mini_xception_keras3_linux

The same results were reproduced with different linux machines, regardless whether it was run on the GPU or CPU, as well as using a JAX backend

What is strange about this issue is that trains successfully with the following configurations:

Linux with Keras 2.15 and TensorFlow 2.15
M1 Mac with Keras 3.0.5 and TensorFlow 2.16.1

Any advice on what may be causing the issue? Let me know if there is any information that I can provide to help troubleshoot the issue.

Thank you!

fchollet commented 6 months ago

Any advice on what may be causing the issue? Let me know if there is any information that I can provide to help troubleshoot the issue.

This code is known to work, so it's likely a bad initialization. Some common steps you can take:

Just restart and try again (with a different random seed)
Lower the learning rate by 2x
Lower dropout rate (0.5 -> 0.25)

t-kalinowski commented 6 months ago

@fchollet I am able to reproduce this. I haven't had a chance to dig into the root cause yet, but I can confirm that this is a bug in Keras 3; the same code produces a model that trains just fine w/ TF 2.15 + Keras 2.

fchollet commented 6 months ago

Looking into it.

fchollet commented 6 months ago

I have fixed a related issue with dataset shuffling. Can you try installing v3.3.3 and checking if your code works with that version?

t-kalinowski commented 6 months ago

Thanks! Looks like it's fixed now. I can confirm the model trains fine with Keras v3.3.3

jonbry commented 6 months ago

Looks like v3.3.3 fixed the issue. Thanks for all of your help!

google-ml-butler[bot] commented 6 months ago

Are you satisfied with the resolution of your issue? Yes No

t-kalinowski commented 6 months ago

By the way, just noticed that github release tagged v3.3.3 has a typo in the title (Kears vs Keras): Kears 3.3.3

Maybe this is the reason v3.3.2 is still listed as the "latest release" on the repo landing page?

sachinprasadhs commented 6 months ago

@t-kalinowski , I just updated the latest release tag in the landing page

keras-team / keras

Model fails to train with Linux and Keras 3.3.2 #19623