keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
62.15k stars 19.49k forks source link

Early-stopping does not work properly in Keras 3 when used in a for loop #20256

Open Senantq opened 2 months ago

Senantq commented 2 months ago

Hello, I am using Keras 3.5 with TF 2.17. My code is more or less the following (but it is not a grid search as in the real code I also increment some other variables that are not directly linked to the network):


def create_conv(nb_cc_value, l1_value):
    model = Sequential()
    model.add(tensorflow.keras.layers.RandomFlip(mode="horizontal"))
    model.add(Conv2D(32, (3,3), activation = 'relu', kernel_regularizer=l1(l1_value)))
    model.add(MaxPool2D())
    model.add(BatchNormalization())
    model.add(Conv2D(64, (3,3), activation = 'relu', kernel_regularizer=l1(l1_value)))
    model.add(MaxPool2D())
    model.add(BatchNormalization())
    model.add(Conv2D(512, (3,3), activation = 'relu', kernel_regularizer=l1(l1_value)))
    model.add(MaxPool2D())
    model.add(BatchNormalization())
    model.add(Conv2D(1024, (3,3), activation = 'relu', kernel_regularizer=l1(l1_value)))
    model.add(BatchNormalization())
    model.add(MaxPool2D())
    model.add(Conv2D(2048, (3,3), activation = 'relu', kernel_regularizer=l1(l1_value)))
    model.add(Flatten())
    model.add(BatchNormalization())
    model.add(Dense(nb_cc_value, activation='relu', kernel_regularizer=l1(l1_value)))
    model.add(Dense(56, activation = 'sigmoid'))
    model.build((None,150,150,1))

    lr_schedule = tensorflow.keras.optimizers.schedules.ExponentialDecay(initial_learning_rate=0.01, decay_steps=10000, decay_rate=0.7, staircase=False)
    optimizer = tensorflow.keras.optimizers.SGD(learning_rate=lr_schedule, momentum = 0.9)
    model.compile(loss= ['mse'], optimizer = optimizer, metrics = ['mse'])
    return model

# %%--------------------------------------------------Initialization
early_stopping = EarlyStopping(monitor='val_mse', min_delta = 0.001, patience=5, restore_best_weights=True)

nb_cc = [2, 6, 12, 102, 302, 602]
l1_values = [2.220446049250313e-16, 0.0000001, 0.0001]

for nb_cc_value in nb_cc:
    for l1_value in l1_values:
        for run in range(1,3):
            model = create_conv(nb_cc_value, l1_value)
            history = model.fit(X_train, y_train, epochs=epoques,callbacks=[early_stopping], validation_data=(X_test, y_test), batch_size=6, shuffle=True, verbose=1)
                # Nettoyage
            del X_train, y_train, X_test, y_test, vectors_dict, ethnie_dict, test_image_counts, model, history, prediction
            tensorflow.keras.backend.clear_session()
            gc.collect()

However, when I run it, only the very first run in the whole code works fine. The others all stops at something like 1 or 2 epochs even if the 'val_mse' variable is decreasing. I have run it using Keras 2.15.0 (tensorflow 2.15.0.post1) and it worked fine then.

Any help is much appreciated, thank you

mehtamansi29 commented 2 months ago

Hi @Senantq -

Can you help me with dataset to reproduce the issue ?

Senantq commented 2 months ago

Hi @mehtamansi29 Sure! Here is the link to a google drive where you can find the fulle code as well as the folder containing the dataset: https://drive.google.com/drive/folders/1W6y-X_UlUNDoHHV8gG4CT5K30LwJWZvc?usp=drive_link

mehtamansi29 commented 2 months ago

Hi @Senantq -

Thanks but the drive links is not accessible for me. Can you provide accessible link ?

ghsanti commented 2 months ago

@Senantq Some possible causes that wouldn't be a bug:

  1. If one changes patience=2 to patience=5 but does not run the cell (does not explain variation though.)
  2. Variation of one unit due to early_stopping not being within mentioned the loop. Because it's outside the loop, the first loop iteration needs an extra epoch.

I do not see further deltas but it may depend on the actual code, if it's different from the included.


In OP, for a standard classification one should use SparseCE, or CE, but I assume OP knows and it's used for a reason.

It's easier to help if one includes a minimal, self-contained code snippet for the issue. Datasets are very easy to load from keras.api.datasets.cifar10 import load.

Example.

mehtamansi29 commented 2 months ago

Hi @Senantq -

I am unable to reproduce your exact code with your dataset as your drive link is not accessible.

But I run your model with some of layers on mnist dataset with same early stopping callbacks and seems working fine. As EarlyStopping(monitor='val_mse', min_delta = 0.001, patience=5, restore_best_weights=True) here patience=5 and monitor='val_mse', so after 5 epochs 'val_mse' is not descreasing then training get stopped.

Attached gist here for your reference.

Senantq commented 2 months ago

Hi everyone, I am very sorry for the delayed response. The link is now accessible, it contains my whole script, the dataset, and my conda environment yaml.

If one changes patience=2 to patience=5 but does not run the cell (does not explain variation though.)

The code is run as a .py script, so the problem does not come from there.

Variation of one unit due to early_stopping not being within mentioned the loop. Because it's outside the loop, the first loop iteration needs an extra epoch

It could have be maybe, but then I don't see why it works perfectly fine with TF 2.15/Keras 2.

In OP, for a standard classification one should use SparseCE, or CE, but I assume OP knows and it's used for a reason.

This is completely voluntary, thank you for the remainder.

It's easier to help if one includes a minimal, self-contained code snippet for the issue. Datasets are very easy to load from keras.api.datasets.cifar10 import load.

Understood. I will try to do the simplest code next time, but I was questioned due to the particularities of the training here.

I am also encountering another problem with the very same script on a cluster, where the code stops within the first 30minutes due to an OOM problem on a A100, but runs for 7h straight on a V100 which as 8Gb less than the A100. So I am beginning to suspect a memory leak that could be due to the CUDAs libs.

Thank you for the time spent

mehtamansi29 commented 2 months ago

Hi @Senantq-

I am very sorry for the delayed response. The link is now accessible, it contains my whole script, the dataset, and my conda environment yaml.

Thanks for the code. I am getting this after running your code.

Ethnie: Caucasians - Sous-dossiers conservees dans le dataset d'entrainement: 0, dataset de test: 0
Ethnie: Afro_Americans - Sous-dossiers conservees dans le dataset d'entrainement: 20, dataset de test: 20

Code:

for nb_cc_value in nb_cc:
    for ethnie in ethnies:
        for proportion in prop:
            proportion = proportion/100.
            for l1_value in l1_values:
                for run in range(1,3): #(1, 11)
                    X_train, y_train, X_test, y_test, vectors_dict, ethnie_dict, test_image_counts = load_images_and_vectors(target_folder=ethnie, base_dir = base_directory, proportion=proportion, ethnie_exclue=ethnie_exclue, target_size=(150,150), test_proportion=0.15)
                    print(X_train.shape)

It means there is no images coming through the training. As due to loop, model is intialize and train for few epochs and after getting zero training image iteration got stop.

Senantq commented 2 months ago

The fact that one of the main folder (here Caucasians) has no training images at the beginning of the 'proportion in prop' for loop is expected. This is due to some research purposes for my PhD in psychology. But it should still receive plenty of training images from the other main folder (Afro_Americans, something like 20*130 images). I don't think this should stop the training however