Strange behaviour: - fitting model with or without `with tf.device('/cpu:0'):` gives completely different losses in training

System information

macOS Monterey (12.6):
tensorflow-macos 2.6.0 + tensorflow-metal 0.2.0 (binary):
tensorflow-addons (0.15.0.dev0 source):
Python version: 3.9
Is GPU used? : No

Describe the bug

I'm using a 3rd party pre-trained model (TCN-based network), for which I want to fine-tune it to different files. (Thus, I can't provide reproducible code as this code needs the pre-trained model.)

I'm doing this in a Mac M1, and due to several errors with installation (see) and using some advanced optimizers (e.g. Lookahead), I started using the with tf.device('/cpu:0'): instruction before interacting with the model (e.g. load, compile and fit), otherwise I'd get errors related with attempting to use the gpu.

Recently, I realised that if I use this instruction when fitting, I get completely different results, and can't figure why. This is the common code:

with tf.device('/cpu:0'):  # type: ignore
    original_model = load_model(str(MODEL_PATH)+'/onset_TCNv2.h5', compile=False)

learnrate = 0.002
clipnorm = 0.5
num_epochs = 50

ft_seq = get_widened_data_sequence()

radam = tfa.optimizers.RectifiedAdam(learning_rate=learnrate, clipnorm=clipnorm)
ranger = tfa.optimizers.Lookahead(radam, sync_period=6, slow_step_size=0.5)
with tf.device('/cpu:0'):  # type: ignore
    original_model.compile(optimizer=ranger, loss=[build_masked_loss(K.binary_crossentropy), build_masked_loss(
        K.binary_crossentropy), build_masked_loss(K.binary_crossentropy)], metrics=['binary_accuracy'])

In the end, if I do:

history = original_model.fit(ft, steps_per_epoch=len(ft), epochs=num_epochs, verbose=1)

I get the following training loss: On the contrary, if I do:

with tf.device('/cpu:0'):
    history = original_model.fit(ft, steps_per_epoch=len(ft), epochs=num_epochs, verbose=1)

I get the very different training loss:

Any idea on what may be causing this issue?

Note: I've also tried with the same models, but with the simple Adam optimizer, and this behaviour did not occur.

tensorflow / addons

Strange behaviour: - fitting model with or without `with tf.device('/cpu:0'):` gives completely different losses in training #2777