Open rsennrich opened 6 years ago
are you using your regular BPE ? with the out-of-the-box subwords script the vocab.endefr.32768 file is as follow: EDIT: I added spaces for the first 2 because github was not displaying properly.
' < pad > ' ' < EOS > ' ', ' '.' 'the' 'de' '' ''' 'in' 'of'
They seem to refer to regular BPE.
My BPE vocab also starts with ,
and .
like Rico's but I didn't experience any issues with not including pad
and eos
. I trained my BPE model back on T2T 1.0.11.
I'm using the problem translate_ende_wmt_bpe32k , which out-of-the-box downloads and uses vocab.bpe.32000 (in v1.2.4).
@Rico: Just wondering how did you manage to get a translation result in your second example using a BPE model. The input has only been tokenized and not preprocessed with BPE. When using the BPE model I think T2T requires you to tokenize as well as apply BPE before submitting for translation, and postprocess the inference result reverting BPE and detokenizing.
Does this problem still exist in 1.2.5? We tried to correct the reserved tokens in vocabs, but maybe there's still sth missing?
@mehmedes : the preprocessed test sets are in /tmp/t2t_datagen (or wherever you downloaded wmt16_en_de.tar.gz ), for instance as newstest2013.tok.bpe.32000.en . From a quick look at the data, the files were (probably) preprocessed like this:
/path/to/mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l en | \ /path/to/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en | \ /path/to/subword-nmt/apply_bpe.py -c bpe.32000
@lukaszkaiser : there's been no change to the vocabs or the handling of reserved tokens between 1.2.4 and 1.2.5 - yes, the problem still persists.
I was asking because your input stated a clean sentence
INFO:tensorflow:Inference results INPUT: In this sense , the measures will partially undermine the American democratic system .
I would have expected the input to be more like
INFO:tensorflow:Inference results INPUT: In th@@ sen@@ etc.
Coincidentally, all words in this sentence happen to be frequent in the training data.
The behavior of TokenTextEncoder
with regards to reserved tokens depends on whether it is constructed from a file or from a list. If it is constructed from a file, then it's expected that the file includes reserved tokens (i.e. the first line is <pad>
and the second line is <EOS>
) if there are any. If it's from a list, then the reserved tokens are added. The reason is because if you initialize from a list and then call store_to_file
, the file will include the reserved tokens.
I am using the latest version without reported problem, when I adopted TokenTextEncoder and external subword. BTW, have anyone tested which is better for NMT task, the internal wordpiece or external subword?
what mean wordpiece and subword, you mean bpe and t2t's default?
@colmantse Yes, that's exactly what I meant. It seems wordpiece has better control over vocabulary size.
bpe or t2t's default? I find t2t's default better on en-zh.
Good to know. Have you processed Chinese sentence with segmentation?
On 17 Jan 2018 22:16, "colmantse" notifications@github.com wrote:
bpe or t2t's default? I find t2t's default better on en-zh.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensor2tensor/issues/365#issuecomment-358317216, or mute the thread https://github.com/notifications/unsubscribe-auth/AY0L4C9C_KS9CuW1MIng-uQWcTx0LTxmks5tLgDHgaJpZM4P6x0C .
I didnt preprocess them. Judging by performance, it works just fine.
取得 Outlook for Android
On Wed, Jan 17, 2018 at 10:39 PM +0800, "lkluo" notifications@github.com wrote:
Good to know. Have you processed Chinese sentence with segmentation?
On 17 Jan 2018 22:16, "colmantse" notifications@github.com wrote:
bpe or t2t's default? I find t2t's default better on en-zh.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/tensorflow/tensor2tensor/issues/365#issuecomment-358317216,
or mute the thread
.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
I get poor results for translate_ende_wmt_bpe32k (this may be related to issues #317 , #309 ), and believe this is due to a problem with reserved tokens not being respected by TokenTextEncoder, and the provided vocabulary file.
The first 10 lines of vocab.bpe.32000 look like this:
Note that "," and "." get assigned the indices 0 and 1, conflicting with the reserved tokens.
I started a new training run with a modified vocab:
and this gives better results (model hasn't converged yet):
I'd suggest that TokenTextEncoder actually reserves the first two integers for reserved tokens, and starts counting from 2, but I realize this will break compatibility with trained models. Alternatively, you could fix the vocabularies of pre-defined problems - I don't know how many are affected.