tensorflow / nmt

TensorFlow Neural Machine Translation Tutorial
Apache License 2.0
6.39k stars 1.96k forks source link

does it support character level embedding? #152

Open zxu7 opened 7 years ago

zxu7 commented 7 years ago

If yes, how do I turn it on during training?

oahziur commented 6 years ago

@zxu7 The code doesn't have a character level embedding option.

However, you may tokenize data at character level, and prepare a character level vocab file to train a character model with the codebase.

AlkaSaliss commented 6 years ago

Hi @oahziur , I'm trying to train a character level NMT mode l. So I bascially built a vocabulary consisting of characters char_vocab But the problem I'm facing is that the space character is also part of my vocabulary but if I add it to the vocab file I get an error: vocab_error So I tried to replace the space with a special character that is not in my vocab, say ~ But during the training the model keep predicting the unknown token <unk>, implying that there may be a tokenization problem. My question is how should I tokenize the data so that the model work with the character-level vocabulary ? As for now I let my data in the same format as for word-level models

thanks!

oahziur commented 6 years ago

You need to make sure your code split the sentence into characters instead of words. By default, the code will split sentences by space, which will cause the problem for you.

https://www.tensorflow.org/api_docs/python/tf/string_split

On Thu, Aug 2, 2018 at 10:01 PM Mahamadou Salissou Aboubacar Alka < notifications@github.com> wrote:

Hi @oahziur https://github.com/oahziur , I'm trying to train a character level NMT mode l. So I bascially built a vocabulary consisting of characters [image: char_vocab] https://user-images.githubusercontent.com/33458274/43587584-cfc9f01c-966a-11e8-8177-3e11b88a3ad8.png But the problem I'm facing is that the space character is also part of my vocabulary but if I add it to the vocab file I get an error: [image: vocab_error] https://user-images.githubusercontent.com/33458274/43587807-591f0d66-966b-11e8-8d06-6c033e014f63.png So I tried to replace the space with a special character that is not in my vocab, say ~ But during the training the model keep predicting the unknown token , implying that there may be a tokenization problem. My question is how should I tokenize the data so that the model work with the character-level vocabulary ? As for now I let my data in the same format as for word-level models

-

source file: [image: src_exple] https://user-images.githubusercontent.com/33458274/43588620-2eff7bea-966d-11e8-9c02-d1fbcc8bd34d.png

target file: [image: tgt_exple] https://user-images.githubusercontent.com/33458274/43588637-3915d0d4-966d-11e8-9fbd-94b558ebdafa.png

thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/nmt/issues/152#issuecomment-409936783, or mute the thread https://github.com/notifications/unsubscribe-auth/AEZCMKDcSijujNMIqwv_BbY75xDmAo9jks5uMwZBgaJpZM4QDx3g .

eswarjal09 commented 6 years ago

Hi @oahziur , To train it as a character level model, i should change the delimiter space to empty string in the tensorflow string_split file or is there a way i can do that in the nmt code?

AlkaSaliss commented 6 years ago

Hi @eswarjal09 . As I was confronted to the situation such as yours, I tried to solve it in a tricky way. Not knowing which part of the script to change in order to tell it to split the data at a character-level, I processed my data in this way :

  1. change all the spaces in my data by a special symbol that I'm sure is not part of my vocabulary (say ~ for example). Thus, a sentence like I eat food. becomes I~eat~food.
  2. transform my sentences in a character level separated by space

So to summarize my data goes from this:

gitnmt1

to this :

gitnmt2 And I trained the model with the normal training procedure provided by the tensorflow-nmt. And to revert back to world-level, I take the results from the inference, remove the whitespaces, and replace the ~ (or whatever special symbol you used) by white spaces.

I am sure there is a better way to handle this by modifying the nmt scripts, but this could work as a temporary solution (at least it worked for me).

shanalikhan commented 5 years ago

@AlkaSaliss were you able to generate character embedding and run NMT using for character level?

AlkaSaliss commented 5 years ago

@shanalikhan Yes I managed to get it work with character-level vocabulary. See my comment above. I'm not sure it is the best way but could work as workaround.