Notebook 07: Cannot save model checkpoint for FoodVision Big

mrdbourke commented 1 year ago

Getting an error when training FoodVision Big:

# Fit the model with callbacks
history_101_food_classes_feature_extract = model.fit(train_data, 
                                                     epochs=3,
                                                     steps_per_epoch=len(train_data),
                                                     validation_data=test_data,
                                                     validation_steps=int(0.15 * len(test_data)),
                                                     callbacks=[create_tensorboard_callback("training_logs", 
                                                                                            "efficientnetb0_101_classes_all_data_feature_extract"),
                                                                model_checkpoint])

>>>WARNING:tensorflow:Can save best model only with val_acc available, skipping.

Looks like it's an issue with the model_checkpoint callback.

This causes the assertion for the cloned model later on to fail:

# Evalaute cloned model with loaded weights (should be same score as trained model)
results_cloned_model_with_loaded_weights = cloned_model.evaluate(test_data)

>>> ---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
/tmp/ipykernel_1443486/1110829135.py in <module>
      1 # Loaded checkpoint weights should return very similar results to checkpoint weights prior to saving
      2 import numpy as np
----> 3 assert np.isclose(results_feature_extract_model, results_cloned_model_with_loaded_weights).all() # check if all elements in array are close

AssertionError:

Need to update the model checkpoint to make sure it can save a model whilst training.

mrdbourke commented 1 year ago

Also getting this in Google Colab:

Screen Shot 2022-09-12 at 4 12 56 pm

mrdbourke commented 1 year ago

After troubleshooting this for a while, it seems there may be something up with the tf.keras.clone_model method.

What exactly, I'm not sure.

It could be due to the use of tf.keras.applications.efficientnet models (which are notorious for errors across TensorFlow versions.

In saying that, a fix I've found to demonstrate the "cloning" and loading of weights is to create a copy of the model by using the exact same code to create it:

# Create a function to recreate the original model
def create_model():
  # Create base model
  input_shape = (224, 224, 3)
  base_model = tf.keras.applications.efficientnet.EfficientNetB0(include_top=False)
  base_model.trainable = False # freeze base model layers

  # Create Functional model 
  inputs = layers.Input(shape=input_shape, name="input_layer")
  # Note: EfficientNetBX models have rescaling built-in but if your model didn't you could have a layer like below
  # x = layers.Rescaling(1./255)(x)
  x = base_model(inputs, training=False) # set base_model to inference mode only
  x = layers.GlobalAveragePooling2D(name="pooling_layer")(x)
  x = layers.Dense(len(class_names))(x) # want one output neuron per class 
  # Separate activation of output layer so we can output float32 activations
  outputs = layers.Activation("softmax", dtype=tf.float32, name="softmax_float32")(x) 
  model = tf.keras.Model(inputs, outputs)

  return model

# Create and compile a new version of the original model (new weights)
created_model = create_model()
created_model.compile(loss="sparse_categorical_crossentropy",
                      optimizer=tf.keras.optimizers.Adam(),
                      metrics=["accuracy"])

# Load the saved weights
created_model.load_weights(checkpoint_path)

# Evaluate the model with loaded weights
results_created_model_with_loaded_weights = created_model.evaluate(test_data)

# Compare results with original model
import numpy as np
assert np.isclose(results_feature_extract_model, results_created_model_with_loaded_weights).all(), "Loaded weights results are not close to original model."  # check if all elements in array are close

In short, instead of using tf.keras.clone_model to compare weights, recreate a new instance of the same model and load the weights instead.

mrdbourke commented 1 year ago

Continuing this here: https://github.com/mrdbourke/tensorflow-deep-learning/discussions/550

In short, it looks like TensorFlow 2.13+ (available via tf-nightly as of May 2023) fixes most of the issues discussed.

mrdbourke / tensorflow-deep-learning

Notebook 07: Cannot save model checkpoint for FoodVision Big #449