tensorflow / models

Models and examples built with TensorFlow
Other
76.78k stars 45.85k forks source link

several errors (4) on movinet streaming_model_training_and_inference notebook when simply ran through on kaggle and colab #11183

Closed EmreSafaCelik closed 2 months ago

EmreSafaCelik commented 3 months ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/blob/master/official/projects/movinet/movinet_streaming_model_training_and_inference.ipynb

2. Describe the bug

Noticed 4 errors on movinet streaming_model_training_and_inference notebook, impossible to train and export the streaming model using this as the guide.

3. Steps to reproduce

First off, if running on kaggle you will get the following error when downloading tf-models-official, but you can (safely?) ignore it and continue:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
keras-nlp 0.8.1 requires keras-core, which is not installed.
tensorflow-decision-forests 1.8.1 requires wurlitzer, which is not installed.
tensorflow-decision-forests 1.8.1 requires tensorflow~=2.15.0, but you have tensorflow 2.16.1 which is incompatible.

Error 1-) When simply running through the cells we get to the cell:

def build_classifier(batch_size, num_frames, resolution, backbone, num_classes):
  """Builds a classifier on top of a backbone model."""
  model = movinet_model.MovinetClassifier(
      backbone=backbone,
      num_classes=num_classes)
  model.build([batch_size, num_frames, resolution, resolution, 3])

  return model

# Construct loss, optimizer and compile the model
with distribution_strategy.scope():
  model = build_classifier(batch_size, num_frames, resolution, backbone, 10)
  loss_obj = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
  optimizer = tf.keras.optimizers.Adam(learning_rate = 0.001)
  model.compile(loss=loss_obj, optimizer=optimizer, metrics=['accuracy'])

Which gives the error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[12], line 15
     13 loss_obj = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
     14 optimizer = tf.keras.optimizers.Adam(learning_rate = 0.001)
---> 15 model.compile(loss=loss_obj, optimizer=optimizer, metrics=['accuracy'])

File /opt/conda/lib/python3.10/site-packages/tf_keras/src/utils/traceback_utils.py:70, in filter_traceback.<locals>.error_handler(*args, **kwargs)
     67     filtered_tb = _process_traceback_frames(e.__traceback__)
     68     # To get the full stack trace, call:
     69     # `tf.debugging.disable_traceback_filtering()`
---> 70     raise e.with_traceback(filtered_tb) from None
     71 finally:
     72     del filtered_tb

File /opt/conda/lib/python3.10/site-packages/tf_keras/src/optimizers/__init__.py:335, in get(identifier, **kwargs)
    330     return get(
    331         config,
    332         use_legacy_optimizer=use_legacy_optimizer,
    333     )
    334 else:
--> 335     raise ValueError(
    336         f"Could not interpret optimizer identifier: {identifier}"
    337     )

ValueError: Could not interpret optimizer identifier: <keras.src.optimizers.adam.Adam object at 0x7b4b686491e0>

A workaround to this is to change the compiling from:

model.compile(loss=loss_obj, optimizer=optimizer, metrics=['accuracy'])

to:

model.compile(loss=loss_obj, optimizer="adam", metrics=['accuracy'])

this means we do not specify learning rate, but at least we get past this part to see the other errors.

Error 2-) After continuing we get to this part:

checkpoint_path = "trained_model/cp.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)

# Create a callback that saves the model's weights
cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
                                                 save_weights_only=True,
                                                 verbose=1)

which gives the easily solvable error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[14], line 5
      2 checkpoint_dir = os.path.dirname(checkpoint_path)
      4 # Create a callback that saves the model's weights
----> 5 cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
      6                                                  save_weights_only=True,
      7                                                  verbose=1)

File /opt/conda/lib/python3.10/site-packages/keras/src/callbacks/model_checkpoint.py:183, in ModelCheckpoint.__init__(self, filepath, monitor, verbose, save_best_only, save_weights_only, mode, save_freq, initial_value_threshold)
    181 if save_weights_only:
    182     if not self.filepath.endswith(".weights.h5"):
--> 183         raise ValueError(
    184             "When using `save_weights_only=True` in `ModelCheckpoint`"
    185             ", the filepath provided must end in `.weights.h5` "
    186             "(Keras weights format). Received: "
    187             f"filepath={self.filepath}"
    188         )
    189 else:
    190     if not self.filepath.endswith(".keras"):

ValueError: When using `save_weights_only=True` in `ModelCheckpoint`, the filepath provided must end in `.weights.h5` (Keras weights format). Received: filepath=trained_model/cp.ckpt

we pass it by changing from .ckpt to .weights.h5:

checkpoint_path = "trained_model/cp.weights.h5"

the change of behaviour does not matter since we won't even be able to use the callback because of the next error

Error 3-) we continue to the next cell:

results = model.fit(train_ds,
                    validation_data=val_ds,
                    epochs=2,
                    validation_freq=1,
                    verbose=1,
                    callbacks=[cp_callback])

Which gives the error I couldn't manage to solve:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[16], line 1
----> 1 results = model.fit(train_ds,
      2                     validation_data=val_ds,
      3                     epochs=2,
      4                     validation_freq=1,
      5                     verbose=1,
      6                     callbacks=[cp_callback])

File /opt/conda/lib/python3.10/site-packages/tf_keras/src/utils/traceback_utils.py:70, in filter_traceback.<locals>.error_handler(*args, **kwargs)
     67     filtered_tb = _process_traceback_frames(e.__traceback__)
     68     # To get the full stack trace, call:
     69     # `tf.debugging.disable_traceback_filtering()`
---> 70     raise e.with_traceback(filtered_tb) from None
     71 finally:
     72     del filtered_tb

File /opt/conda/lib/python3.10/site-packages/tf_keras/src/callbacks.py:245, in <genexpr>(.0)
    237 # Performance optimization: determines if batch hooks need to be called.
    239 self._supports_tf_logs = all(
    240     getattr(cb, "_supports_tf_logs", False) for cb in self.callbacks
    241 )
    242 self._batch_hooks_support_tf_logs = all(
    243     getattr(cb, "_supports_tf_logs", False)
    244     for cb in self.callbacks
--> 245     if cb._implements_train_batch_hooks()
    246     or cb._implements_test_batch_hooks()
    247     or cb._implements_predict_batch_hooks()
    248 )
    250 self._should_call_train_batch_hooks = any(
    251     cb._implements_train_batch_hooks() for cb in self.callbacks
    252 )
    253 self._should_call_test_batch_hooks = any(
    254     cb._implements_test_batch_hooks() for cb in self.callbacks
    255 )

AttributeError: 'ModelCheckpoint' object has no attribute '_implements_train_batch_hooks'

What we can do is remove the callback alltogether to see the errors existing later in the code:

results = model.fit(train_ds,
                    validation_data=val_ds,
                    epochs=2,
                    validation_freq=1,
                    verbose=1,
                    callbacks=[cp_callback]  #remove this argument
)

Error 4-) The next error is in "Reconstruct the whole model with use_external_states=True to make the inference using states." part's first cell, this error will be fixed by itself once the errors 2 and 3 are fixed as it needs ModelCheckpoint to run (as per my broken understanding):

model_id = 'a0'
use_positional_encoding = model_id in {'a3', 'a4', 'a5'}
resolution = 172

# Create backbone and model.
backbone = movinet.Movinet(
    model_id=model_id,
    causal=True,
    conv_type='2plus1d',
    se_type='2plus3d',
    activation='hard_swish',
    gating_activation='hard_sigmoid',
    use_positional_encoding=use_positional_encoding,
    use_external_states=True,
)

model = movinet_model.MovinetClassifier(
    backbone,
    num_classes=10,
    output_states=True)

# Create your example input here.
# Refer to the paper for recommended input shapes.
inputs = tf.ones([1, 13, 172, 172, 3])

# [Optional] Build the model and load a pretrained checkpoint.
model.build(inputs.shape)

# Load weights from the checkpoint to the rebuilt model
checkpoint_dir = 'trained_model'
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

4. System information

Errors exist both on Kaggle and Colab

laxmareddyp commented 3 months ago

Hi @EmreSafaCelik ,

Thanks for letting us know about this problem. We're aware it's due to recent changes in Keras 3. Specifically, there's an issue with how the code references tf.keras which now points to Keras 3. We'll fix the notebook to address this and provide an update as soon as possible.

Thanks.

EmreSafaCelik commented 3 months ago

Alright, thank you!

engares commented 3 months ago

Having the same problem here! Thanks!

engares commented 3 months ago

Hi @EmreSafaCelik ,

Thanks for letting us know about this problem. We're aware it's due to recent changes in Keras 3. Specifically, there's an issue with how the code references tf.keras which now points to Keras 3. We'll fix the notebook to address this and provide an update as soon as possible.

Thanks.

Hi

There's any update on this?

Thanks!

laxmareddyp commented 3 months ago

Hi @engares,

We are still working on it , you will see a comment here from our side once we fixing the changes.

Thanks

EmreSafaCelik commented 3 months ago

Hi @engares,

You can just change the line: import tensorflow as tf to: import tensorflow as tf, tf_keras

then change all tf.keras. calls to start with tf_keras. instead. Hope this helps.

google-ml-butler[bot] commented 2 months ago

Are you satisfied with the resolution of your issue? Yes No