Closed HamDan1999 closed 1 year ago
That terminating training
is threw from myCallbacks.py#L30 because loss is infinite. For MobileNetV3, I've tried some training, that disabling replace_ReLU_with_PReLU
works better. Maybe caused by the hard_swish
activation using relu
inside. Also may try script like Mobilenet using Adamw + SGDW training on Emore dataset that using AdamW
/ SGDW
instead of L2 regularizer
. Other things like a smaller lr
, using {"loss": losses.ArcfaceLoss(scale=16), "epoch": 1, "optimizer": optimizer}
as first scheduler may also help.
# basic_model = models.add_l2_regularizer_2_model(basic_model, weight_decay=5e-4, apply_to_batch_normal=False)
# basic_model = models.replace_ReLU_with_PReLU(basic_model)
...
tt = train.Train(..., lr_base=0.001, ...)
# optimizer = keras.optimizers.SGD(learning_rate=0.1, momentum=0.9)
optimizer = tfa.optimizers.AdamW(learning_rate=0.001, weight_decay=5e-5)
...
Noted with thanks. I really appreciate your work. I will try and let you know.
Hi, I built different backbones (transformer-based) and tried to train them using your code. However, I got an error, I thought it was because of my backbone, but I also tried to run MobileNetV3 and I got the same error as follows:
Learning rate for iter 1 is 0.10000000149011612, global_iterNum is 0 WARNING:tensorflow:5 out of the last 6 calls to <function Model.make_train_function..train_function at 0x7f610c2ef160> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details.
10/45489 [..............................] - ETA: 4:55:10 - loss: 30.7130 - accuracy: 0.0000e+00
Error: Invalid loss, terminating training
An exception has occurred, use %tb to see the full traceback.
SystemExit
Here is my script:
import losses, train, models import tensorflow_addons as tfa import os from tensorflow import keras import tensorflow as tf
import losses, train, models, os from tensorflow import keras
print(tf.config.list_physical_devices('GPU'))
keras.mixed_precision.set_global_policy("mixed_float16")
gpus = tf.config.experimental.list_physical_devices("GPU") for gpu in gpus: tf.config.experimental.set_memory_growth(gpu, True)
data_path = 'datasets/faces_emore_112x112_folders' eval_paths = ['datasets/faces_emore/lfw.bin', 'datasets/faces_emore/cfp_fp.bin', 'datasets/faces_emore/agedb_30.bin']
basic_model = models.buildin_models("mobilenetv3", dropout=0, emb_shape=512, output_layer='GDC', bn_momentum=0.9, bn_epsilon=1e-5) basic_model = models.add_l2_regularizer_2_model(basic_model, weight_decay=5e-4, apply_to_batch_normal=False) basic_model = models.replace_ReLU_with_PReLU(basic_model)
tt = train.Train(data_path, eval_paths=eval_paths, save_path='mobilenet.h5', basic_model=basic_model, model=None, lr_base=0.1, lr_decay=0.5, lr_decay_steps=16, lr_min=1e-5, batch_size=128, random_status=0, eval_freq=1, output_weight_decay=1)
optimizer = keras.optimizers.SGD(learning_rate=0.1, momentum=0.9) sch = [ {"loss": losses.ArcfaceLoss(scale=32), "epoch": 1, "optimizer": optimizer}, {"loss": losses.ArcfaceLoss(scale=64), "epoch": 100}, ] tt.train(sch, 0)