TokenTextEncoder does not respect reserved tokens

rsennrich commented 6 years ago

I get poor results for translate_ende_wmt_bpe32k (this may be related to issues #317 , #309 ), and believe this is due to a problem with reserved tokens not being respected by TokenTextEncoder, and the provided vocabulary file.

INFO:tensorflow:Inference results INPUT: In this sense , the measures will partially undermine the American democratic system . INFO:tensorflow:Inference results OUTPUT: In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In

{'outputs': array([68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68], dtype=int32), 'problem_choice': 0, 'inputs': array([[ 68], [ 35], [ 2196], [ 0], [ 2], [ 651], [ 55], [18587], [15840], [ 2], [ 1874], [ 1763], [ 260], [ 1], [ 1], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0]], dtype=int32)}

The first 10 lines of vocab.bpe.32000 look like this:

, . the in of and die der to und

Note that "," and "." get assigned the indices 0 and 1, conflicting with the reserved tokens.

I started a new training run with a modified vocab:

pad eos , . the in of and die der

and this gives better results (model hasn't converged yet):

INFO:tensorflow:Inference results INPUT: In this sense , the measures will partially undermine the American democratic system . INFO:tensorflow:Inference results OUTPUT: In diesem Sinne wird die Maßnahmen in der Lage sein , das amerikanische System zu bekämpfen .

{'outputs': array([ 54, 11958, 11, 10408, 164, 231, 940, 92, 2802, 9, 12051, 7944, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32), 'problem_choice': 0, 'inputs': array([[24820], [ 22], [ 3467], [ 8753], [ 111], [ 322], [ 48], [ 4], [ 229], [ 10], [ 4995], [14095], [ 7298], [ 3], [ 1], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0]], dtype=int32)}

I'd suggest that TokenTextEncoder actually reserves the first two integers for reserved tokens, and starts counting from 2, but I realize this will break compatibility with trained models. Alternatively, you could fix the vocabularies of pre-defined problems - I don't know how many are affected.

vince62s commented 6 years ago

are you using your regular BPE ? with the out-of-the-box subwords script the vocab.endefr.32768 file is as follow: EDIT: I added spaces for the first 2 because github was not displaying properly.

' < pad > ' ' < EOS > ' ', ' '.' 'the' 'de' '' ''' 'in' 'of'

mehmedes commented 6 years ago

They seem to refer to regular BPE. My BPE vocab also starts with , and . like Rico's but I didn't experience any issues with not including pad and eos. I trained my BPE model back on T2T 1.0.11.

rsennrich commented 6 years ago

I'm using the problem translate_ende_wmt_bpe32k , which out-of-the-box downloads and uses vocab.bpe.32000 (in v1.2.4).

mehmedes commented 6 years ago

@Rico: Just wondering how did you manage to get a translation result in your second example using a BPE model. The input has only been tokenized and not preprocessed with BPE. When using the BPE model I think T2T requires you to tokenize as well as apply BPE before submitting for translation, and postprocess the inference result reverting BPE and detokenizing.

lukaszkaiser commented 6 years ago

Does this problem still exist in 1.2.5? We tried to correct the reserved tokens in vocabs, but maybe there's still sth missing?

rsennrich commented 6 years ago

@mehmedes : the preprocessed test sets are in /tmp/t2t_datagen (or wherever you downloaded wmt16_en_de.tar.gz ), for instance as newstest2013.tok.bpe.32000.en . From a quick look at the data, the files were (probably) preprocessed like this:

/path/to/mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l en | \ /path/to/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en | \ /path/to/subword-nmt/apply_bpe.py -c bpe.32000

@lukaszkaiser : there's been no change to the vocabs or the handling of reserved tokens between 1.2.4 and 1.2.5 - yes, the problem still persists.

mehmedes commented 6 years ago

I was asking because your input stated a clean sentence

INFO:tensorflow:Inference results INPUT: In this sense , the measures will partially undermine the American democratic system .

I would have expected the input to be more like

INFO:tensorflow:Inference results INPUT: In th@@ sen@@ etc.

rsennrich commented 6 years ago

Coincidentally, all words in this sentence happen to be frequent in the training data.

rsepassi commented 6 years ago

The behavior of TokenTextEncoder with regards to reserved tokens depends on whether it is constructed from a file or from a list. If it is constructed from a file, then it's expected that the file includes reserved tokens (i.e. the first line is <pad> and the second line is <EOS>) if there are any. If it's from a list, then the reserved tokens are added. The reason is because if you initialize from a list and then call store_to_file, the file will include the reserved tokens.

lkluo commented 6 years ago

I am using the latest version without reported problem, when I adopted TokenTextEncoder and external subword. BTW, have anyone tested which is better for NMT task, the internal wordpiece or external subword?

colmantse commented 6 years ago

what mean wordpiece and subword, you mean bpe and t2t's default?

lkluo commented 6 years ago

@colmantse Yes, that's exactly what I meant. It seems wordpiece has better control over vocabulary size.

colmantse commented 6 years ago

bpe or t2t's default? I find t2t's default better on en-zh.

lkluo commented 6 years ago

Good to know. Have you processed Chinese sentence with segmentation?

On 17 Jan 2018 22:16, "colmantse" notifications@github.com wrote:

bpe or t2t's default? I find t2t's default better on en-zh.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensor2tensor/issues/365#issuecomment-358317216, or mute the thread https://github.com/notifications/unsubscribe-auth/AY0L4C9C_KS9CuW1MIng-uQWcTx0LTxmks5tLgDHgaJpZM4P6x0C .

colmantse commented 6 years ago

I didnt preprocess them. Judging by performance, it works just fine.

取得 Outlook for Android

On Wed, Jan 17, 2018 at 10:39 PM +0800, "lkluo" notifications@github.com wrote:

Good to know. Have you processed Chinese sentence with segmentation?

On 17 Jan 2018 22:16, "colmantse" notifications@github.com wrote:

bpe or t2t's default? I find t2t's default better on en-zh.

—

You are receiving this because you commented.

Reply to this email directly, view it on GitHub

https://github.com/tensorflow/tensor2tensor/issues/365#issuecomment-358317216,

or mute the thread

https://github.com/notifications/unsubscribe-auth/AY0L4C9C_KS9CuW1MIng-uQWcTx0LTxmks5tLgDHgaJpZM4P6x0C

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

tensorflow / tensor2tensor

TokenTextEncoder does not respect reserved tokens #365