[question] - using custom vocabulary

tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.

Apache License 2.0

15.52k stars 3.5k forks source link

[question] - using custom vocabulary #850

Open jestjest opened 6 years ago

jestjest commented 6 years ago

Are there any helpful posts or requirements into how to use tensor2tensor with a custom vocabulary? It's for a translation problem.

For example, do we need to include and as the first two lines in the vocabulary file, and UNK at the end?

I'm following the example from https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/translate_ende.py and it seems like generate_samples will create a vocabulary file from a temporary one already?

Thank you.

martinpopel commented 6 years ago

Do you want a custom subword vocabulary (SubwordTextEncoder) or word vocabulary (TokenTextEncoder)? See https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/text_encoder.py

jestjest commented 6 years ago

Just word vocabulary (and I have my own pad/unk/eos token strings).

martinpopel commented 6 years ago

So let's use the TokenTextEncoder with parameter replace_oov pointing to your UNK symbol. Its source code is simple, so check it to see how RESERVED_TOKENS (PAD and EOS) are handled depending on the parameters and whether you load the vocabulary from a list or from a file.

EthannyDing commented 5 years ago

@martinpopel Hi, I have a similar problem regarding OOV: i have a bilingual word file containing source words and its target translation. when decoding a source text sentence, i want to use this file to help translate words in the sentence that also appear in my word file into its target translation.

For example: for english-german machine translation

I want to translate a sentence: "We are not happy with the decision of Commission."

My trained model would give me this, which is still great: "Wir sind mit der Entscheidung der Commission nicht zufrieden."

but my word file has src-tgt pair (Commission, Kommission), so i want the translation to be like this: "Wir sind mit der Entscheidung der Kommission nicht zufrieden."

Does this problem have something to do with oov? Thank you in advance.

martinpopel commented 5 years ago

This issue was about a vocabulary for segmentation into tokens. You want something else - a custom dictionary with forced translation pairs. There is no out-of-the-box solution for this in T2T.

A simple but naive solution is to add the custom translation pairs to the training data. However, this most probably won't help with translation of full sentences.

Another solution is to post-process the translations, using word alignments (which is not produced by T2T and heuristically guessing it from multi-head cross-attention weights is problematic).

In conclusion, a reliable solution of custom dictionaries requires a lot of work. (Imagine that instead of "Kommission" there would be a different forced translation with a different morphological gender, so you would need to change also the rest of the translation, including the article "der".)

bharat-patidar commented 4 years ago

Hi @martinpopel I have similar question for speech recognition model. How can I add custom vocabulary to recognize specific words for already trained T2T model?

Thanks

martinpopel commented 4 years ago

@bharat-patidar I have no experience with speech recognition, sorry.

ghost commented 4 years ago

Hi @martinpopel, I want to use a custom subword vocabulary, so do I need to use SubwordTextEncoder? I'm confused because if I'm not wrong, when we use BPE, we just use TokenTextEncoder and add the BPE vocabulary there.

If I use a custom subword vocabulary, do I also need to apply any pre-tokenization on my dataset?

Thanks!

lkluo commented 3 years ago

@martinpopel Hi, I have a similar problem regarding OOV: i have a bilingual word file containing source words and its target translation. when decoding a source text sentence, i want to use this file to help translate words in the sentence that also appear in my word file into its target translation.

For example: for english-german machine translation

I want to translate a sentence: "We are not happy with the decision of Commission."

My trained model would give me this, which is still great: "Wir sind mit der Entscheidung der Commission nicht zufrieden."

but my word file has src-tgt pair (Commission, Kommission), so i want the translation to be like this: "Wir sind mit der Entscheidung der Kommission nicht zufrieden."

Does this problem have something to do with oov? Thank you in advance.

You may consider constraint decoding which does exactly what you want to do.