Open Changl14 opened 5 years ago
Hi, guys. I have an error when using the 'translate_enzh_wmt32k_rev' problem for decoding. It was fine during training with the reverse problem name but interrupted when decoding. ...
OS: Ubuntu 16.04 $ pip freeze | grep tensor tensorflow-gpu == 1.12.0 tensor2tensor == 1.12.0 $ python -V python == 3.6
# Steps to reproduce: #!/bin/bash GPU=0 PROBLEM=translate_enzh_wmt32k MODEL=transformer HPARAMS=transformer_base REV_PROBLEM=${PROBLEM}_rev root_path=~/back_translate DATA_DIR=${root_path}/t2t_data/$PROBLEM TMP_DIR=${root_path}/t2t_datagen/$PROBLEM TRAIN_DIR=${root_path}/t2t_train/$PROBLEM/$MODEL-$HPARAMS REV_TRAIN_DIR=${root_path}/t2t_train/$REV_PROBLEM/$MODEL-$HPARAMS USR_DIR=${root_path}/usr_dir AVG_DIR=$root_path/t2t_avg/$REV_PROBLEM/$MODEL-$HPARAMS mkdir -p $DATA_DIR $TMP_DIR $TRAIN_DIR $REV_TRAIN_DIR mkdir -p $USR_DIR $AVG_DIR ## Generate data #t2t-datagen \ # --data_dir=$DATA_DIR \ # --tmp_dir=$TMP_DIR \ # --problem=$PROBLEM \ # --t2t_usr_dir=${USR_DIR} # ## Train ## * If you run out of memory, add --hparams='batch_size=1024'. #CUDA_VISIBLE_DEVICES=${GPU} t2t-trainer \ # --data_dir=$DATA_DIR \ # --problem=$REV_PROBLEM \ # --model=$MODEL \ # --hparams_set=$HPARAMS \ # --output_dir=$REV_TRAIN_DIR \ # --t2t_usr_dir=${USR_DIR} \ # --train_steps=300000 \ # --random_seed=65 \ # --worker_gpu=2 \ # --schedule=train #CUDA_VISIBLE_DEVICES=0 t2t-avg-all \ # --model_dir=$REV_TRAIN_DIR \ # --output_dir=$AVG_DIR \ # --n=20 DECODE_FILE=$root_path/decoder/monoling_0008.token ## BEAM_SIZE=4 ALPHA=0.6 CUDA_VISIBLE_DEVICES=${GPU} t2t-decoder \ --data_dir=$DATA_DIR \ --problem=$REV_PROBLEM \ --model=$MODEL \ --hparams_set=$HPARAMS \ --output_dir=$AVG_DIR \ --decode_hparams="beam_size=$BEAM_SIZE,alpha=$ALPHA" \ --decode_from_file=$DECODE_FILE \ --decode_to_file=$DECODE_FILE.en
# Error logs: Key training/transformer/symbol_modality_32788_512/softmax/weights_0/Adam not found in checkpoint [[node save/RestoreV2_1 (defined at ~/anaconda2/envs/t2t_upgrading/lib/python3.5/site-packages/tensor2tensor/utils/trainer_lib.py:454) = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_1/tensor_names, save/RestoreV2_1/shape_and_slices)]]
Unfortunately, error also occurs during training. When I interrupted training and then continued it, this error might also happened.
I have encountered the same problem.
Description
Hi, guys. I have an error when using the 'translate_enzh_wmt32k_rev' problem for decoding. It was fine during training with the reverse problem name but interrupted when decoding. ...
Environment information
For bugs: reproduction and error logs