tensorflow / recommenders-addons

Additional utils and helpers to extend TensorFlow when build recommendation systems, contributed and maintained by SIG Recommenders.
Apache License 2.0
596 stars 136 forks source link

How to save and restore a model with dynamic embeddings? #451

Open alykhantejani opened 3 months ago

alykhantejani commented 3 months ago

Hi,

I am training a model with dynamic embeddings (specifically HvdAllToAllEmbeddings). I am saving the model to disk with de.keras.models.de_save_model and I see that it appears my dynamic embedding variables are saved to disk.

However, when restoring from this directory it appears only the dense weights get restored. I am restoring with model.load_weights(FLAGS.model_dir) as shown here

Am I supposed to restore a KVCreator too?

ZunwenYou commented 3 months ago

The same to me!

When I load trained model from disk for incremental training, it will failed when fit(train_dataset)

I load model by model = tf.keras.models.load_model(FLAGS.model_dir)

the error log is

Traceback (most recent call last):
  File "/apdcephfs/dd_model/recommenders-addons-0.7.2/demo/dynamic_embedding/movielens-1m-keras/movielens-1m-keras.py", line 247, in <module>
    app.run(main)
  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/apdcephfs/dd_model/recommenders-addons-0.7.2/demo/dynamic_embedding/movielens-1m-keras/movielens-1m-keras.py", line 237, in main
    train()
  File "/apdcephfs/dd_model/recommenders-addons-0.7.2/demo/dynamic_embedding/movielens-1m-keras/movielens-1m-keras.py", line 147, in train
    model.fit(dataset, epochs=FLAGS.epochs, steps_per_epoch=FLAGS.steps_per_epoch)
  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/tensorflow/python/eager/execute.py", line 53, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:

Detected at node Adam/ResourceScatterAdd_3 defined at (most recent call last):
  File "/apdcephfs/dd_model/recommenders-addons-0.7.2/demo/dynamic_embedding/movielens-1m-keras/movielens-1m-keras.py", line 247, in <module>

  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/absl/app.py", line 308, in run

  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main

  File "/apdcephfs/dd_model/recommenders-addons-0.7.2/demo/dynamic_embedding/movielens-1m-keras/movielens-1m-keras.py", line 237, in main

  File "/apdcephfs/dd_model/recommenders-addons-0.7.2/demo/dynamic_embedding/movielens-1m-keras/movielens-1m-keras.py", line 147, in train

  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler

  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/keras/src/engine/training.py", line 1807, in fit

  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/keras/src/engine/training.py", line 1401, in train_function

  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/keras/src/engine/training.py", line 1384, in step_function

  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/keras/src/engine/training.py", line 1373, in run_step

  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/keras/src/engine/training.py", line 1154, in train_step

  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/keras/src/optimizers/optimizer.py", line 544, in minimize

  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/keras/src/optimizers/optimizer.py", line 1223, in apply_gradients

  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/keras/src/optimizers/optimizer.py", line 652, in apply_gradients

  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/keras/src/optimizers/optimizer.py", line 1253, in _internal_apply_gradients

  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/keras/src/optimizers/optimizer.py", line 1345, in _distributed_apply_gradients_fn

  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/keras/src/optimizers/optimizer.py", line 1342, in apply_grad_to_update_var

  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/keras/src/optimizers/optimizer.py", line 241, in _update_step

  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/keras/src/optimizers/adam.py", line 185, in update_step

indices[0] = 0 is not in [0, 0)
         [[{{node Adam/ResourceScatterAdd_3}}]] [Op:__inference_train_function_3810]
2024-08-08 16:21:52.222686: W tensorflow/core/kernels/data/cache_dataset_ops.cc:858] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
2024-08-08 16:21:52.232673: W tensorflow/core/kernels/data/cache_dataset_ops.cc:858] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
MoFHeka commented 3 months ago

Sorry, TFRA is hard to support tf.keras.models.load_model API. Because load_model will create trainable variable object from TensorFlow, but TFRA trainable wrapper is not in TF code.