Open jestjest opened 6 years ago
Do you want a custom subword vocabulary (SubwordTextEncoder
) or word vocabulary (TokenTextEncoder)?
See https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/text_encoder.py
Just word vocabulary (and I have my own pad/unk/eos token strings).
So let's use the TokenTextEncoder with parameter replace_oov
pointing to your UNK symbol.
Its source code is simple, so check it to see how RESERVED_TOKENS (PAD and EOS) are handled depending on the parameters and whether you load the vocabulary from a list or from a file.
@martinpopel Hi, I have a similar problem regarding OOV: i have a bilingual word file containing source words and its target translation. when decoding a source text sentence, i want to use this file to help translate words in the sentence that also appear in my word file into its target translation.
For example: for english-german machine translation
I want to translate a sentence: "We are not happy with the decision of Commission."
My trained model would give me this, which is still great: "Wir sind mit der Entscheidung der Commission nicht zufrieden."
but my word file has src-tgt pair (Commission, Kommission), so i want the translation to be like this: "Wir sind mit der Entscheidung der Kommission nicht zufrieden."
Does this problem have something to do with oov? Thank you in advance.
This issue was about a vocabulary for segmentation into tokens. You want something else - a custom dictionary with forced translation pairs. There is no out-of-the-box solution for this in T2T.
A simple but naive solution is to add the custom translation pairs to the training data. However, this most probably won't help with translation of full sentences.
Another solution is to post-process the translations, using word alignments (which is not produced by T2T and heuristically guessing it from multi-head cross-attention weights is problematic).
In conclusion, a reliable solution of custom dictionaries requires a lot of work. (Imagine that instead of "Kommission" there would be a different forced translation with a different morphological gender, so you would need to change also the rest of the translation, including the article "der".)
Hi @martinpopel I have similar question for speech recognition model. How can I add custom vocabulary to recognize specific words for already trained T2T model?
Thanks
@bharat-patidar I have no experience with speech recognition, sorry.
Hi @martinpopel, I want to use a custom subword vocabulary, so do I need to use SubwordTextEncoder? I'm confused because if I'm not wrong, when we use BPE, we just use TokenTextEncoder and add the BPE vocabulary there.
If I use a custom subword vocabulary, do I also need to apply any pre-tokenization on my dataset?
Thanks!
@martinpopel Hi, I have a similar problem regarding OOV: i have a bilingual word file containing source words and its target translation. when decoding a source text sentence, i want to use this file to help translate words in the sentence that also appear in my word file into its target translation.
For example: for english-german machine translation
I want to translate a sentence: "We are not happy with the decision of Commission."
My trained model would give me this, which is still great: "Wir sind mit der Entscheidung der Commission nicht zufrieden."
but my word file has src-tgt pair (Commission, Kommission), so i want the translation to be like this: "Wir sind mit der Entscheidung der Kommission nicht zufrieden."
Does this problem have something to do with oov? Thank you in advance.
You may consider constraint decoding which does exactly what you want to do.
Are there any helpful posts or requirements into how to use tensor2tensor with a custom vocabulary? It's for a translation problem.
For example, do we need to include and as the first two lines in the vocabulary file, and UNK at the end?
I'm following the example from https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/translate_ende.py and it seems like
generate_samples
will create a vocabulary file from a temporary one already?Thank you.