Open Senantq opened 2 months ago
Hi @Senantq -
Can you help me with dataset to reproduce the issue ?
Hi @mehtamansi29 Sure! Here is the link to a google drive where you can find the fulle code as well as the folder containing the dataset: https://drive.google.com/drive/folders/1W6y-X_UlUNDoHHV8gG4CT5K30LwJWZvc?usp=drive_link
Hi @Senantq -
Thanks but the drive links is not accessible for me. Can you provide accessible link ?
@Senantq Some possible causes that wouldn't be a bug:
patience=2
to patience=5
but does not run the cell (does not explain variation though.)early_stopping
not being within mentioned the loop. Because it's outside the loop, the first loop iteration needs an extra epoch.I do not see further deltas but it may depend on the actual code, if it's different from the included.
In OP, for a standard classification one should use SparseCE, or CE, but I assume OP knows and it's used for a reason.
It's easier to help if one includes a minimal, self-contained code snippet for the issue. Datasets are very easy to load from keras.api.datasets.cifar10 import load
.
Hi @Senantq -
I am unable to reproduce your exact code with your dataset as your drive link is not accessible.
But I run your model with some of layers on mnist dataset with same early stopping callbacks and seems working fine. As EarlyStopping(monitor='val_mse', min_delta = 0.001, patience=5, restore_best_weights=True)
here patience=5
and monitor='val_mse'
, so after 5 epochs 'val_mse' is not descreasing then training get stopped.
Attached gist here for your reference.
Hi everyone, I am very sorry for the delayed response. The link is now accessible, it contains my whole script, the dataset, and my conda environment yaml.
If one changes patience=2 to patience=5 but does not run the cell (does not explain variation though.)
The code is run as a .py script, so the problem does not come from there.
Variation of one unit due to early_stopping not being within mentioned the loop. Because it's outside the loop, the first loop iteration needs an extra epoch
It could have be maybe, but then I don't see why it works perfectly fine with TF 2.15/Keras 2.
In OP, for a standard classification one should use SparseCE, or CE, but I assume OP knows and it's used for a reason.
This is completely voluntary, thank you for the remainder.
It's easier to help if one includes a minimal, self-contained code snippet for the issue. Datasets are very easy to load from keras.api.datasets.cifar10 import load.
Understood. I will try to do the simplest code next time, but I was questioned due to the particularities of the training here.
I am also encountering another problem with the very same script on a cluster, where the code stops within the first 30minutes due to an OOM problem on a A100, but runs for 7h straight on a V100 which as 8Gb less than the A100. So I am beginning to suspect a memory leak that could be due to the CUDAs libs.
Thank you for the time spent
Hi @Senantq-
I am very sorry for the delayed response. The link is now accessible, it contains my whole script, the dataset, and my conda environment yaml.
Thanks for the code. I am getting this after running your code.
Ethnie: Caucasians - Sous-dossiers conservees dans le dataset d'entrainement: 0, dataset de test: 0
Ethnie: Afro_Americans - Sous-dossiers conservees dans le dataset d'entrainement: 20, dataset de test: 20
Code:
for nb_cc_value in nb_cc:
for ethnie in ethnies:
for proportion in prop:
proportion = proportion/100.
for l1_value in l1_values:
for run in range(1,3): #(1, 11)
X_train, y_train, X_test, y_test, vectors_dict, ethnie_dict, test_image_counts = load_images_and_vectors(target_folder=ethnie, base_dir = base_directory, proportion=proportion, ethnie_exclue=ethnie_exclue, target_size=(150,150), test_proportion=0.15)
print(X_train.shape)
It means there is no images coming through the training. As due to loop, model is intialize and train for few epochs and after getting zero training image iteration got stop.
The fact that one of the main folder (here Caucasians) has no training images at the beginning of the 'proportion in prop' for loop is expected. This is due to some research purposes for my PhD in psychology. But it should still receive plenty of training images from the other main folder (Afro_Americans, something like 20*130 images). I don't think this should stop the training however
Hello, I am using Keras 3.5 with TF 2.17. My code is more or less the following (but it is not a grid search as in the real code I also increment some other variables that are not directly linked to the network):
However, when I run it, only the very first run in the whole code works fine. The others all stops at something like 1 or 2 epochs even if the 'val_mse' variable is decreasing. I have run it using Keras 2.15.0 (tensorflow 2.15.0.post1) and it worked fine then.
Any help is much appreciated, thank you