tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.58k stars 3.51k forks source link

Language model perplexity #861

Open fciannella opened 6 years ago

fciannella commented 6 years ago

Description

This is a question, not an issue.

I want to set up a language model using the 1B words. These are my training parameters:

PROBLEM=languagemodel_lm1b32k
USR_DIR=/home/.../tensor2tensor/data_generators
REMOTE_DATA_DIR=lm1b32k_data
TMP_DIR=/tmp/t2t_tmp_lm1b32k
BUCKET=...
MODEL=transformer
HPARAMS=transformer_base
OUTDIR=lm1b32k_out
mkdir -p $TMP_DIR
REMOTE_DECODE_DIR=lm1b32k_decode
DECODE_FILE_NAME=decode_this.txt

And this is how I train and decode:

t2t-trainer \
 --data_dir=gs://${BUCKET}/${REMOTE_DATA_DIR} \
 --t2t_usr_dir=$USR_DIR \
 --problem=$PROBLEM \
 --model=$MODEL \
 --hparams_set=$HPARAMS \
 --output_dir=gs://${BUCKET}/${OUTDIR} \
 --cloud_mlengine --worker_gpu=4
DECODE_FILE=${TMP_DIR}/${DECODE_FILE_NAME}
echo "Hello world, this is just for testing" >> $DECODE_FILE
gsutil cp -r ${DECODE_FILE} gs://${BUCKET}/${REMOTE_DECODE_DIR}/${DECODE_FILE_NAME}
BEAM_SIZE=4
ALPHA=0.6

t2t-decoder \
  --data_dir=gs://${BUCKET}/${REMOTE_DATA_DIR} \
  --problem=$PROBLEM \
  --model=$MODEL \
  --hparams_set=$HPARAMS \
  --output_dir=gs://${BUCKET}/${OUTDIR} \
  --decode_hparams="beam_size=$BEAM_SIZE,alpha=$ALPHA" \
  --decode_from_file=gs://${BUCKET}/${REMOTE_DECODE_DIR}/${DECODE_FILE_NAME}

When I decode I only get a new sentence. But I would like to get the perplexity of the input sentence. Is there a way to do it?

Environment information

OS: <your answer here>

$ pip freeze | grep tensor
tensor2tensor==1.6.3
tensorboard==1.8.0
tensorflow==1.8.0
tensorflow-serving-api-python3==1.7.0

$ python -V
Python 3.5.2

Also I see that this was already asked before (#212 ), but in the answer there is a pointer to a test case that is no more available.

xu-song commented 6 years ago

+1

hl312 commented 5 years ago

https://github.com/tensorflow/tensor2tensor/issues/212 , excuse , the url is not arrived , could you help to give another way or url to show the method to computer perplexity of a new sentence , thanks .

azagsam commented 5 years ago

I have found a score_file flag in the t2t_decoder.py script: https://github.com/tensorflow/tensor2tensor/blob/abbd929558dd29115acc9d0f035f1efddb45566d/tensor2tensor/bin/t2t_decoder.py#L58

It calculates perplexity for each line in a file.