tensorflow / addons

Useful extra functionality for TensorFlow 2.x maintained by SIG-addons
Apache License 2.0
1.69k stars 611 forks source link

Strange behaviour: - fitting model with or without `with tf.device('/cpu:0'):` gives completely different losses in training #2777

Closed asapsmc closed 1 year ago

asapsmc commented 2 years ago

System information

Describe the bug

I'm using a 3rd party pre-trained model (TCN-based network), for which I want to fine-tune it to different files. (Thus, I can't provide reproducible code as this code needs the pre-trained model.)

I'm doing this in a Mac M1, and due to several errors with installation (see) and using some advanced optimizers (e.g. Lookahead), I started using the with tf.device('/cpu:0'): instruction before interacting with the model (e.g. load, compile and fit), otherwise I'd get errors related with attempting to use the gpu.

Recently, I realised that if I use this instruction when fitting, I get completely different results, and can't figure why. This is the common code:

with tf.device('/cpu:0'):  # type: ignore
    original_model = load_model(str(MODEL_PATH)+'/onset_TCNv2.h5', compile=False)

learnrate = 0.002
clipnorm = 0.5
num_epochs = 50

ft_seq = get_widened_data_sequence()

radam = tfa.optimizers.RectifiedAdam(learning_rate=learnrate, clipnorm=clipnorm)
ranger = tfa.optimizers.Lookahead(radam, sync_period=6, slow_step_size=0.5)
with tf.device('/cpu:0'):  # type: ignore
    original_model.compile(optimizer=ranger, loss=[build_masked_loss(K.binary_crossentropy), build_masked_loss(
        K.binary_crossentropy), build_masked_loss(K.binary_crossentropy)], metrics=['binary_accuracy'])

In the end, if I do:

history = original_model.fit(ft, steps_per_epoch=len(ft), epochs=num_epochs, verbose=1)

I get the following training loss: enter image description here On the contrary, if I do:

with tf.device('/cpu:0'):
    history = original_model.fit(ft, steps_per_epoch=len(ft), epochs=num_epochs, verbose=1)

I get the very different training loss: enter image description here

Any idea on what may be causing this issue?

Note: I've also tried with the same models, but with the simple Adam optimizer, and this behaviour did not occur.

seanpmorgan commented 1 year ago

TensorFlow Addons is transitioning to a minimal maintenance and release mode. New features will not be added to this repository. For more information, please see our public messaging on this decision: TensorFlow Addons Wind Down

Please consider sending feature requests / contributions to other repositories in the TF community with a similar charters to TFA: Keras Keras-CV Keras-NLP