Error in the training - Githubissues

HamDan1999 commented 1 year ago

Hi, I built different backbones (transformer-based) and tried to train them using your code. However, I got an error, I thought it was because of my backbone, but I also tried to run MobileNetV3 and I got the same error as follows:

Init type by loss function name... Train arcface... Init softmax dataset... reloaded from dataset backup: faces_emore_112x112_folders_shuffle.npz Image length: 5822653, Image class length: 5822653, classes: 85742 Use specified optimizer: <keras.optimizers.optimizer_v2.gradient_descent.SGD object at 0x7f60c545bf10> Add L2 regularizer to model output layer, output_weight_decay = 0.000500 Add arcface layer, arc_kwargs={'loss_top_k': 1, 'append_norm': False, 'partial_fc_split': 0, 'name': 'arcface'}, vpl_kwargs={'vpl_lambda': 0.15, 'start_iters': -45489, 'allowed_delta': 200}... loss_weights: {'arcface': 1}

Learning rate for iter 1 is 0.10000000149011612, global_iterNum is 0 WARNING:tensorflow:5 out of the last 6 calls to <function Model.make_train_function..train_function at 0x7f610c2ef160> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details. 10/45489 [..............................] - ETA: 4:55:10 - loss: 30.7130 - accuracy: 0.0000e+00 Error: Invalid loss, terminating training An exception has occurred, use %tb to see the full traceback.

SystemExit

Here is my script:

import losses, train, models import tensorflow_addons as tfa import os from tensorflow import keras import tensorflow as tf

import losses, train, models, os from tensorflow import keras

print(tf.config.list_physical_devices('GPU'))

keras.mixed_precision.set_global_policy("mixed_float16")

gpus = tf.config.experimental.list_physical_devices("GPU") for gpu in gpus: tf.config.experimental.set_memory_growth(gpu, True)

data_path = 'datasets/faces_emore_112x112_folders' eval_paths = ['datasets/faces_emore/lfw.bin', 'datasets/faces_emore/cfp_fp.bin', 'datasets/faces_emore/agedb_30.bin']

basic_model = models.buildin_models("mobilenetv3", dropout=0, emb_shape=512, output_layer='GDC', bn_momentum=0.9, bn_epsilon=1e-5) basic_model = models.add_l2_regularizer_2_model(basic_model, weight_decay=5e-4, apply_to_batch_normal=False) basic_model = models.replace_ReLU_with_PReLU(basic_model)

tt = train.Train(data_path, eval_paths=eval_paths, save_path='mobilenet.h5', basic_model=basic_model, model=None, lr_base=0.1, lr_decay=0.5, lr_decay_steps=16, lr_min=1e-5, batch_size=128, random_status=0, eval_freq=1, output_weight_decay=1)

optimizer = keras.optimizers.SGD(learning_rate=0.1, momentum=0.9) sch = [ {"loss": losses.ArcfaceLoss(scale=32), "epoch": 1, "optimizer": optimizer}, {"loss": losses.ArcfaceLoss(scale=64), "epoch": 100}, ] tt.train(sch, 0)

leondgarse commented 1 year ago

That terminating training is threw from myCallbacks.py#L30 because loss is infinite. For MobileNetV3, I've tried some training, that disabling replace_ReLU_with_PReLU works better. Maybe caused by the hard_swish activation using relu inside. Also may try script like Mobilenet using Adamw + SGDW training on Emore dataset that using AdamW / SGDW instead of L2 regularizer. Other things like a smaller lr, using {"loss": losses.ArcfaceLoss(scale=16), "epoch": 1, "optimizer": optimizer} as first scheduler may also help.

# basic_model = models.add_l2_regularizer_2_model(basic_model, weight_decay=5e-4, apply_to_batch_normal=False)
# basic_model = models.replace_ReLU_with_PReLU(basic_model)
...
tt = train.Train(..., lr_base=0.001, ...)
# optimizer = keras.optimizers.SGD(learning_rate=0.1, momentum=0.9)
optimizer = tfa.optimizers.AdamW(learning_rate=0.001, weight_decay=5e-5)
...

HamDan1999 commented 1 year ago

Noted with thanks. I really appreciate your work. I will try and let you know.

leondgarse / Keras_insightface

Error in the training #105