tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.33k stars 3.47k forks source link

TokenTextEncoder does not respect reserved tokens #365

Open rsennrich opened 6 years ago

rsennrich commented 6 years ago

I get poor results for translate_ende_wmt_bpe32k (this may be related to issues #317 , #309 ), and believe this is due to a problem with reserved tokens not being respected by TokenTextEncoder, and the provided vocabulary file.

INFO:tensorflow:Inference results INPUT: In this sense , the measures will partially undermine the American democratic system . INFO:tensorflow:Inference results OUTPUT: In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In

{'outputs': array([68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68], dtype=int32), 'problem_choice': 0, 'inputs': array([[ 68], [ 35], [ 2196], [ 0], [ 2], [ 651], [ 55], [18587], [15840], [ 2], [ 1874], [ 1763], [ 260], [ 1], [ 1], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0]], dtype=int32)}

The first 10 lines of vocab.bpe.32000 look like this:

, . the in of and die der to und

Note that "," and "." get assigned the indices 0 and 1, conflicting with the reserved tokens.

I started a new training run with a modified vocab:

pad eos , . the in of and die der

and this gives better results (model hasn't converged yet):

INFO:tensorflow:Inference results INPUT: In this sense , the measures will partially undermine the American democratic system . INFO:tensorflow:Inference results OUTPUT: In diesem Sinne wird die Maßnahmen in der Lage sein , das amerikanische System zu bekämpfen .

{'outputs': array([ 54, 11958, 11, 10408, 164, 231, 940, 92, 2802, 9, 12051, 7944, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32), 'problem_choice': 0, 'inputs': array([[24820], [ 22], [ 3467], [ 8753], [ 111], [ 322], [ 48], [ 4], [ 229], [ 10], [ 4995], [14095], [ 7298], [ 3], [ 1], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0], [ 0]], dtype=int32)}

I'd suggest that TokenTextEncoder actually reserves the first two integers for reserved tokens, and starts counting from 2, but I realize this will break compatibility with trained models. Alternatively, you could fix the vocabularies of pre-defined problems - I don't know how many are affected.

vince62s commented 6 years ago

are you using your regular BPE ? with the out-of-the-box subwords script the vocab.endefr.32768 file is as follow: EDIT: I added spaces for the first 2 because github was not displaying properly.

' < pad > ' ' < EOS > ' ', ' '.' 'the' 'de' '' ''' 'in' 'of'

mehmedes commented 6 years ago

They seem to refer to regular BPE. My BPE vocab also starts with , and . like Rico's but I didn't experience any issues with not including pad and eos. I trained my BPE model back on T2T 1.0.11.

rsennrich commented 6 years ago

I'm using the problem translate_ende_wmt_bpe32k , which out-of-the-box downloads and uses vocab.bpe.32000 (in v1.2.4).

mehmedes commented 6 years ago

@Rico: Just wondering how did you manage to get a translation result in your second example using a BPE model. The input has only been tokenized and not preprocessed with BPE. When using the BPE model I think T2T requires you to tokenize as well as apply BPE before submitting for translation, and postprocess the inference result reverting BPE and detokenizing.

lukaszkaiser commented 6 years ago

Does this problem still exist in 1.2.5? We tried to correct the reserved tokens in vocabs, but maybe there's still sth missing?

rsennrich commented 6 years ago

@mehmedes : the preprocessed test sets are in /tmp/t2t_datagen (or wherever you downloaded wmt16_en_de.tar.gz ), for instance as newstest2013.tok.bpe.32000.en . From a quick look at the data, the files were (probably) preprocessed like this:

/path/to/mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l en | \ /path/to/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en | \ /path/to/subword-nmt/apply_bpe.py -c bpe.32000

@lukaszkaiser : there's been no change to the vocabs or the handling of reserved tokens between 1.2.4 and 1.2.5 - yes, the problem still persists.

mehmedes commented 6 years ago

I was asking because your input stated a clean sentence

INFO:tensorflow:Inference results INPUT: In this sense , the measures will partially undermine the American democratic system .

I would have expected the input to be more like

INFO:tensorflow:Inference results INPUT: In th@@ sen@@ etc. 
rsennrich commented 6 years ago

Coincidentally, all words in this sentence happen to be frequent in the training data.

rsepassi commented 6 years ago

The behavior of TokenTextEncoder with regards to reserved tokens depends on whether it is constructed from a file or from a list. If it is constructed from a file, then it's expected that the file includes reserved tokens (i.e. the first line is <pad> and the second line is <EOS>) if there are any. If it's from a list, then the reserved tokens are added. The reason is because if you initialize from a list and then call store_to_file, the file will include the reserved tokens.

lkluo commented 6 years ago

I am using the latest version without reported problem, when I adopted TokenTextEncoder and external subword. BTW, have anyone tested which is better for NMT task, the internal wordpiece or external subword?

colmantse commented 6 years ago

what mean wordpiece and subword, you mean bpe and t2t's default?

lkluo commented 6 years ago

@colmantse Yes, that's exactly what I meant. It seems wordpiece has better control over vocabulary size.

colmantse commented 6 years ago

bpe or t2t's default? I find t2t's default better on en-zh.

lkluo commented 6 years ago

Good to know. Have you processed Chinese sentence with segmentation?

On 17 Jan 2018 22:16, "colmantse" notifications@github.com wrote:

bpe or t2t's default? I find t2t's default better on en-zh.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensor2tensor/issues/365#issuecomment-358317216, or mute the thread https://github.com/notifications/unsubscribe-auth/AY0L4C9C_KS9CuW1MIng-uQWcTx0LTxmks5tLgDHgaJpZM4P6x0C .

colmantse commented 6 years ago

I didnt preprocess them. Judging by performance, it works just fine.

取得 Outlook for Android

On Wed, Jan 17, 2018 at 10:39 PM +0800, "lkluo" notifications@github.com wrote:

Good to know. Have you processed Chinese sentence with segmentation?

On 17 Jan 2018 22:16, "colmantse" notifications@github.com wrote:

bpe or t2t's default? I find t2t's default better on en-zh.

You are receiving this because you commented.

Reply to this email directly, view it on GitHub

https://github.com/tensorflow/tensor2tensor/issues/365#issuecomment-358317216,

or mute the thread

https://github.com/notifications/unsubscribe-auth/AY0L4C9C_KS9CuW1MIng-uQWcTx0LTxmks5tLgDHgaJpZM4P6x0C

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.