Closed mehmedes closed 7 years ago
Follow the instructions of this README.md to add a data generator for your problem.
O ok. Thank you!! I overlooked that. I'll give it a try
I tried to follow the steps in the README.md
but didn't manage to import my own data creating a new data generator. In the end, I used wmt_ende_bpe32k
, replaced all data of the WMT corpus with my own files and named my files according to the file names in wmt16_en_de.tar.gz
. This clumsy workaround helped me to create an engine based on my own data. I know this is actually not a tensor2tensor issue, but it's due to the lack of my expertise. My request maybe redundant to many, nevertheless I would greatly appreciate if setting up new training configurations could be more accessible.
Thank you for your support and outstanding work!
Yes, we realized it's a bit too hard now and we're working on it. Keeping this open to keep track, once the new way is done and documented, we'll ask you to give it a try :).
Hi,
I tested the wmt_ende_bpe32k
problem.
This problem requires us to download this data first.
My questions are:
vocab.bpe.32000
organized? I found the code only uses this vocab.vocab.bpe.32000
includes both source and target vocab?*_bpe_*
problem?Thank you so much
Hi cshando,
The dataset wmt16_en_de.tar.gz
refers to the pre-build dataset as described in this seq2seq tutorial. This documentation also features a script to prepare the training data on your own. Based the seq2seq tutorial, the script I used for creating data that fulfils the specs of wmt_ende_bpe32k
looks like this:
set -e
BASE_DIR="$HOME/seq2seq"
OUTPUT_DIR=${OUTPUT_DIR:-$HOME/tensortest/output}
echo "Writing to ${OUTPUT_DIR}. To change this, set the OUTPUT_DIR environment variable."
# Tokenize data
for f in ${OUTPUT_DIR}/*.de; do
echo "Tokenizing $f..."
${OUTPUT_DIR}/mosesdecoder/scripts/tokenizer/tokenizer.perl -q -l de -threads 8 < $f > ${f%.*}.tok.de
done
for f in ${OUTPUT_DIR}/*.en; do
echo "Tokenizing $f..."
${OUTPUT_DIR}/mosesdecoder/scripts/tokenizer/tokenizer.perl -q -l en -threads 8 < $f > ${f%.*}.tok.en
done
# Clean all corpora
for f in ${OUTPUT_DIR}/*.en; do
fbase=${f%.*}
echo "Cleaning ${fbase}..."
${OUTPUT_DIR}/mosesdecoder/scripts/training/clean-corpus-n.perl $fbase de en "${fbase}.clean" 1 80
done
# Create vocabulary for EN data
$BASE_DIR/bin/tools/generate_vocab.py \
--max_vocab_size 50000 \
< ${OUTPUT_DIR}/train.tok.clean.en \
> ${OUTPUT_DIR}/vocab.50k.en \
# Create vocabulary for DE data
$BASE_DIR/bin/tools/generate_vocab.py \
--max_vocab_size 50000 \
< ${OUTPUT_DIR}/train.tok.clean.de \
> ${OUTPUT_DIR}/vocab.50k.de \
# Generate Subword Units (BPE)
# Clone Subword NMT
if [ ! -d "${OUTPUT_DIR}/subword-nmt" ]; then
git clone https://github.com/rsennrich/subword-nmt.git "${OUTPUT_DIR}/subword-nmt"
fi
# Learn Shared BPE
for merge_ops in 32000; do
echo "Learning BPE with merge_ops=${merge_ops}. This may take a while..."
cat "${OUTPUT_DIR}/train.tok.clean.de" "${OUTPUT_DIR}/train.tok.clean.en" | \
${OUTPUT_DIR}/subword-nmt/learn_bpe.py -s $merge_ops > "${OUTPUT_DIR}/bpe.${merge_ops}"
echo "Apply BPE with merge_ops=${merge_ops} to tokenized files..."
for lang in en de; do
for f in ${OUTPUT_DIR}/*.tok.${lang} ${OUTPUT_DIR}/*.tok.clean.${lang}; do
outfile="${f%.*}.bpe.${merge_ops}.${lang}"
${OUTPUT_DIR}/subword-nmt/apply_bpe.py -c "${OUTPUT_DIR}/bpe.${merge_ops}" < $f > "${outfile}"
echo ${outfile}
done
done
# Create vocabulary file for BPE
cat "${OUTPUT_DIR}/train.tok.clean.bpe.${merge_ops}.en" "${OUTPUT_DIR}/train.tok.clean.bpe.${merge_ops}.de" | \
${OUTPUT_DIR}/subword-nmt/get_vocab.py | cut -f1 -d ' ' > "${OUTPUT_DIR}/vocab.bpe.${merge_ops}"
done
This will create the necessary training, testing, vocab and bpe files. Please note that the actual wmt16_en_de.tar.gz
contains multiple test data: newstest2009, newstest2011, newstest2012, newstest2013, newstest2014, newstest2015, newstest2016. You need to make sure you have as many files as the actual wmt16_en_de.tar.gz
and the filenames should not deviate.
I then tarred the files as wmt16_en_de.tar.gz
, placed in the \tmp\t2t_datagen\ folder and ran:
PROBLEM=wmt_ende_bpe32k
MODEL=transformer
HPARAMS=transformer_base
DATA_DIR=$HOME/t2t_data
TMP_DIR=/tmp/t2t_datagen
TRAIN_DIR=$HOME/t2t_train/$PROBLEM/$MODEL-$HPARAMS
mkdir -p $DATA_DIR $TMP_DIR $TRAIN_DIR
# Generate data
t2t-datagen \
--data_dir=$DATA_DIR \
--tmp_dir=$TMP_DIR \
--num_shards=100 \
--problem=$PROBLEM
mv $TMP_DIR/vocab.bpe.32000 $DATA_DIR
# Train
# * If you run out of memory, add --hparams='batch_size=2048' or even 1024.
t2t-trainer \
--data_dir=$DATA_DIR \
--problems=$PROBLEM \
--model=$MODEL \
--hparams_set=$HPARAMS \
--output_dir=$TRAIN_DIR \
--worker_gpu=2 \
# Decode
DECODE_FILE=$DATA_DIR/decode_this.txt
echo "Hello world" >> $DECODE_FILE
echo "Goodbye world" >> $DECODE_FILE
BEAM_SIZE=4
ALPHA=0.6
t2t-trainer \
--data_dir=$DATA_DIR \
--problems=$PROBLEM \
--model=$MODEL \
--hparams_set=$HPARAMS \
--output_dir=$TRAIN_DIR \
--train_steps=0 \
--eval_steps=0 \
--decode_beam_size=$BEAM_SIZE \
--decode_alpha=$ALPHA \
--decode_from_file=$DECODE_FILE
cat $DECODE_FILE.$MODEL.$HPARAMS.beam$BEAM_SIZE.alpha$ALPHA.decodes
Hi @mehmedes Thank you very much for your detailed explanation!
I know about how and why to create a BPE vocabulary, but am curious that why merging source and target vocabs into a single file.
It might be meaningful for some translation directions such as English-German, because these two languages might share some words. Doing this might reduce the size of the whole vocabulary.
While in English-Chinese for example, I'm not sure whether it's a good idea to merge them into a single one, or whether tensor2tensor
support separated bpe vocab settings.
Hi @cshanbo -- tensor2tensor doesn't force to merge vocabs, we have 2 different ones for parsing, for example (--problems=wsj_parsing_tokens_16k
for example).
But we found that using merged vocabs can have advantages for translation with wordpieces, even for Chinese. The reason is in copying of proper names, which occurs a lot. Say in the sentence "He lives in Sunnyvale." the word "Sunnyvale" get split into "Sunny@@ vale" (2 wordpieces) in the source vocab. If it gets split into 3 pieces (e.g., "Sun@@ ny@@ vale") in the target vocab, then the model needs to learn to copy these particular 2 pieces into those 3. With common vocab, it just needs to learn to copy 1-1. These are not big differences and might depend on vocab sizes, but wanted to let you know.
Feel free to reopen if you have more questions!
@lukaszkaiser Hi Lukaszkaiser, thanks a lot for the detailed explanation above. I tried to use the code in tensor2tensor with some custom data, and I started by creating a problem similarly to wmt_ende_tokens_8k = : https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/wmt.py#L387
I do have some questions:
1) it seems that the function https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/generator_utils.py#L302 is used to generate the vocabulary - and it does that on both lang (source/target). This confirms - as you said before - that we can mix the two vocabularies into a single file. My question is though: by mistake, I generated a vocabulary only based on the source language - still, the model trained/decoded (with not stellar performances, but still producing meaningful output). So, I was wondering, in this case, how the model could produce the output tokens. (am I missing some magic somewhere?)
2) from your explanation above, it should seem that there is no code in tensor2tensor to generate a BPE based vocabulary (indeed, you used some mix of perl script to do so - then you used TokenTextEncoder to load the vocabulary as it is). Is that correct?
3) assuming I am right on 2, I can see that the usual way (in tensor2tensor) to generate a vocab file is using SubwordTextEncoder - which seems to generate a vocab with a size close to the target one. I was wondering what is the difference between this approach and BPE - given they seem to do something similar here. i.e., SubwordTextEncoder will split the less frequent tokens so that it can keep the vocabulary size small - if requested, while BPE also will try to find subword units to handle rare world)
Thanks a lot again, Mirko
I generated a vocabulary only based on the source language - still, the model trained/decoded
If the two languages share the same alphabet, it is probable that most words in the target language can be broken into subwords (possibly single-character or two-character subwords) trained only from the source language. Of course, this is very suboptimal.
there is no code in tensor2tensor to generate a BPE based vocabulary what is the difference between this approach and BPE
SubwordTextEncoder generates word-pieces which is almost the same as BPE: BPE uses special end-of-word character, while wordpieces use a special start-of-word character.
Let's answer your points one-by-one.
(1) As Martin already mentioned, our SubwordTextEncoder will split anything into pieces. In fact we make sure that any vocabulary can encode any string: if it's not among the words, it'll be split in pieces, and if that doesn't work, then into Unicode characters. If even that doesn't work (e.g., your vocab was English but you're doing Chinese), then we have escape-sequences on byte level -- your characters will be split into bytes. If your model is strong enough, it'll train on that and output reasonable things -- but it'll be better if the vocabulary matches at least mostly.
(2) Yes. Our SubwordTextEncoder is very similar to BPE, but not exactly the same -- we make sure it's invertible and that all things can be encoded with escape sequences (see above). We don't have an exact copy of BPE, but SubwordTextEncoder was sufficient for most things so far.
(3) As you say, SubwordTextEncoder and BPE are very similar. There are slight differences in many parts though: how the words are split at first, which subwords-tokens are chosen, how to escape outside characters, things like that. But the principle is very very similar.
Hope that helps!
Thanks @lukaszkaiser and @martinpopel .
Indeed, I can confirm that the models basically learned to decode one letter per time. Results were interesting in the sense that they were correct up to a few tokens - and then they were just truncated (which is expected, given how many steps it requires to produce a few tokens letter by letter).
Hi @lukaszkaiser and @martinpopel , I'd have two follow up questions about the SubwordTextEncoder:
(1) could you confirm that the vocab target size is the main parameter to decide how many subtokens we want? In short, )it seems to me that) with a very big vocab size, I basically end up with all the tokens + letters/digits, while with a small one, I get the most frequent tokens, some (few) subtokens, and letters/digits.
(2) w.r.t. how subtokenizer works, is this example (of a OOV) correct? 'counter-intuitive' => 'counter', '-', 'intuitive' 'counter', '-', 'intuitive' => 3 embeddings (which are going in the encoder) decoder will produce some tokens, e.g., 'contre', '-', 'intuitif' assuming this is correct, which part of the code is going to reattach together the last 3 tokens? E.g., 'contre', '-', 'intuitif' => 'contre-intuitif'
(3) assuming (1) is correct, it would seem that if you specify a big target vocabulary size (enough to cover all tokens + the letters/digits) the letters+digits embeddings are basically never trained (give you have all the token in the vocabulary - and you never split a token into letters/digits). Is that correct?
(4) in the case where a problem is somehow easier to handle with just tokens and a OOV symbol (no subtokens / letters / digits), is it correct to use TokenTextEncoder instead of SubwordTextEncoder (even if I am not sure it is supporting OOV - like, a special symbol for OOV)?
Thanks again, Mirko
Dear Mirko,
I don't think assumption 2 is correct, it looks like you are trying to lemmatize the tokens there, but bpe or subword model as implemented here I believe, is more like a Hierarchical agglomerative clustering thing, so you break all occurrences in your FreqDist(data) first and you have a lot of characters. At each time step, you take the most frequent occurrence into your vocab, so initially you should have a lot of single characters, then you start to merge them one character at a time and again select the one with highest frequencies into your vocab until you reach your vocab size. Therefore if you have super large vocab, it will just be the same as your max possible vocab, where each word itself is also a token, plus their sub-forms. So when there is an OOV, the subword encoder should just choose the most frequent and longest segments to form it, so you have, if the vocab is smaller, count er - in tuit ive, like this. I suspect it can also skip the initial FreqDist(data) step, I did not look in detail into the implementation here so please correct me if I am wrong.
Thanks @colmantse - I agree with you - the example was misleading. Indeed, BPE is more a bottom-up process (like HAC, as you pointed out), than a top-down breaking process.
Apart this clarification though, I still have doubts above point 1 / 3 / 4 (above).
Thanks, Mirko
Ah, sorry it is silly of me for misreading (2).
Best, Colman
Is it possible to run the Walkthrough example from the website with other data than WMT?
I've tried changing the data paths in
wmt.py:
But when I run the example with new paths, it still downloads the WMT data...