About the '_rev' in translation problem when using two vocabularies

Changl14 commented 5 years ago

Description

Hi, guys. I have an error when using the 'translate_enzh_wmt32k_rev' problem for decoding. It was fine during training with the reverse problem name but interrupted when decoding. ...

Environment information

OS: Ubuntu 16.04

$ pip freeze | grep tensor
tensorflow-gpu == 1.12.0
tensor2tensor  == 1.12.0

$ python -V
python == 3.6

For bugs: reproduction and error logs

# Steps to reproduce:
#!/bin/bash

GPU=0
PROBLEM=translate_enzh_wmt32k
MODEL=transformer
HPARAMS=transformer_base
REV_PROBLEM=${PROBLEM}_rev

root_path=~/back_translate

DATA_DIR=${root_path}/t2t_data/$PROBLEM
TMP_DIR=${root_path}/t2t_datagen/$PROBLEM
TRAIN_DIR=${root_path}/t2t_train/$PROBLEM/$MODEL-$HPARAMS
REV_TRAIN_DIR=${root_path}/t2t_train/$REV_PROBLEM/$MODEL-$HPARAMS
USR_DIR=${root_path}/usr_dir
AVG_DIR=$root_path/t2t_avg/$REV_PROBLEM/$MODEL-$HPARAMS

mkdir -p $DATA_DIR $TMP_DIR $TRAIN_DIR $REV_TRAIN_DIR
mkdir -p $USR_DIR $AVG_DIR

## Generate data
#t2t-datagen \
#  --data_dir=$DATA_DIR \
#  --tmp_dir=$TMP_DIR \
#  --problem=$PROBLEM \
#  --t2t_usr_dir=${USR_DIR}
#
## Train
## *  If you run out of memory, add --hparams='batch_size=1024'.
#CUDA_VISIBLE_DEVICES=${GPU} t2t-trainer \
#  --data_dir=$DATA_DIR \
#  --problem=$REV_PROBLEM \
#  --model=$MODEL \
#  --hparams_set=$HPARAMS \
#  --output_dir=$REV_TRAIN_DIR \
#  --t2t_usr_dir=${USR_DIR} \
#  --train_steps=300000 \
#  --random_seed=65 \
#  --worker_gpu=2 \
#  --schedule=train

#CUDA_VISIBLE_DEVICES=0 t2t-avg-all \
#    --model_dir=$REV_TRAIN_DIR \
#    --output_dir=$AVG_DIR \
#    --n=20

DECODE_FILE=$root_path/decoder/monoling_0008.token
##
BEAM_SIZE=4
ALPHA=0.6

CUDA_VISIBLE_DEVICES=${GPU} t2t-decoder \
  --data_dir=$DATA_DIR \
  --problem=$REV_PROBLEM \
  --model=$MODEL \
  --hparams_set=$HPARAMS \
  --output_dir=$AVG_DIR \
  --decode_hparams="beam_size=$BEAM_SIZE,alpha=$ALPHA" \
  --decode_from_file=$DECODE_FILE \
  --decode_to_file=$DECODE_FILE.en

# Error logs:
Key training/transformer/symbol_modality_32788_512/softmax/weights_0/Adam not found in checkpoint
    [[node save/RestoreV2_1 (defined at ~/anaconda2/envs/t2t_upgrading/lib/python3.5/site-packages/tensor2tensor/utils/trainer_lib.py:454)  = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_1/tensor_names, save/RestoreV2_1/shape_and_slices)]]

Changl14 commented 5 years ago

Unfortunately, error also occurs during training. When I interrupted training and then continued it, this error might also happened.

xiaoqiangkx commented 5 years ago

I have encountered the same problem.

tensorflow / tensor2tensor

About the '_rev' in translation problem when using two vocabularies #1378

Description

Environment information

For bugs: reproduction and error logs