tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.6k stars 3.51k forks source link

When following Magenta's Score2perf's README, checkpoint doesn't have some keys #1869

Open heyzude opened 3 years ago

heyzude commented 3 years ago

Ubuntu 18.04 Python 3.7.9 Tensorflow 2.3.1

When I follow https://github.com/magenta/magenta/blob/master/magenta/models/score2perf/README.md, The problem happens when I follow Training and Sampling from the model part.

The Training command is like below at the README.

DATA_DIR=/generated/tfrecords/dir
HPARAMS_SET=score2perf_transformer_base
MODEL=transformer
PROBLEM=score2perf_maestro_language_uncropped_aug
TRAIN_DIR=/training/dir

HPARAMS=\
"label_smoothing=0.0,"\
"max_length=0,"\
"max_target_seq_length=2048"

t2t_trainer \
  --data_dir="${DATA_DIR}" \
  --hparams=${HPARAMS} \
  --hparams_set=${HPARAMS_SET} \
  --model=${MODEL} \
  --output_dir=${TRAIN_DIR} \
  --problem=${PROBLEM} \
  --train_steps=1000000

when I do as what Training at README says, I got this error, after training 1000 epoches, and the python file tries to load from 1000 epoch checkpoint and to evaluaiton.

Not found: Key transformer/parallel_0_3/transformer/transformer/body/decoder/layer_0/self_attention/multihead_attention/k/kernel not found in checkpoint
[[node save/RestoreV2_1 (defined at /.pyenv/versions/3.7.9/envs/tensor2tensor/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py:629) ]]
tensorflow.python.framework.errors_impl.NotFoundError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint

However, When I run the command at Training again, surprisingly, it succeeds to load from checkpoint and train from 1000 epoch, and save 2000 epoch weights. But then again, when it loads from 2000 epoch checkpoint and try to do evaluaiton, it fails.

For Inference (Sampling from the model), it just fails.

Anyone could help me? Thanks in advance.

dongmingli-Ben commented 3 years ago

I meet the same problem with tensorflow 2.4.0. When I tried to load a checkpoint downloaded from magenta as in the colab, it fails. When I run the training commands, the checkpoints saved at 1000 epochs cannot be loaded.

almostimplemented commented 2 years ago

Also hitting this issue right now, training fresh.