Open zxu7 opened 7 years ago
@zxu7 The code doesn't have a character level embedding option.
However, you may tokenize data at character level, and prepare a character level vocab file to train a character model with the codebase.
Hi @oahziur ,
I'm trying to train a character level NMT mode l. So I bascially built a vocabulary consisting of characters
But the problem I'm facing is that the space
character is also part of my vocabulary but if I add it to the vocab file I get an error:
So I tried to replace the space
with a special character that is not in my vocab, say ~
But during the training the model keep predicting the unknown token <unk>
, implying that there may be a tokenization problem. My question is how should I tokenize the data so that the model work with the character-level vocabulary ?
As for now I let my data in the same format as for word-level models
source file:
target file:
thanks!
You need to make sure your code split the sentence into characters instead of words. By default, the code will split sentences by space, which will cause the problem for you.
https://www.tensorflow.org/api_docs/python/tf/string_split
On Thu, Aug 2, 2018 at 10:01 PM Mahamadou Salissou Aboubacar Alka < notifications@github.com> wrote:
Hi @oahziur https://github.com/oahziur , I'm trying to train a character level NMT mode l. So I bascially built a vocabulary consisting of characters [image: char_vocab] https://user-images.githubusercontent.com/33458274/43587584-cfc9f01c-966a-11e8-8177-3e11b88a3ad8.png But the problem I'm facing is that the space character is also part of my vocabulary but if I add it to the vocab file I get an error: [image: vocab_error] https://user-images.githubusercontent.com/33458274/43587807-591f0d66-966b-11e8-8d06-6c033e014f63.png So I tried to replace the space with a special character that is not in my vocab, say ~ But during the training the model keep predicting the unknown token
, implying that there may be a tokenization problem. My question is how should I tokenize the data so that the model work with the character-level vocabulary ? As for now I let my data in the same format as for word-level models -
source file: [image: src_exple] https://user-images.githubusercontent.com/33458274/43588620-2eff7bea-966d-11e8-9c02-d1fbcc8bd34d.png
target file: [image: tgt_exple] https://user-images.githubusercontent.com/33458274/43588637-3915d0d4-966d-11e8-9fbd-94b558ebdafa.png
thanks!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/nmt/issues/152#issuecomment-409936783, or mute the thread https://github.com/notifications/unsubscribe-auth/AEZCMKDcSijujNMIqwv_BbY75xDmAo9jks5uMwZBgaJpZM4QDx3g .
Hi @oahziur , To train it as a character level model, i should change the delimiter space to empty string in the tensorflow string_split file or is there a way i can do that in the nmt code?
Hi @eswarjal09 . As I was confronted to the situation such as yours, I tried to solve it in a tricky way. Not knowing which part of the script to change in order to tell it to split the data at a character-level, I processed my data in this way :
~
for example). Thus, a sentence like I eat food.
becomes I~eat~food.
So to summarize my data goes from this:
to this :
And I trained the model with the normal training procedure provided by the tensorflow-nmt. And to revert back to world-level, I take the results from the inference, remove the whitespaces, and replace the ~
(or whatever special symbol you used) by white spaces.
I am sure there is a better way to handle this by modifying the nmt scripts, but this could work as a temporary solution (at least it worked for me).
@AlkaSaliss were you able to generate character embedding and run NMT using for character level?
@shanalikhan Yes I managed to get it work with character-level vocabulary. See my comment above. I'm not sure it is the best way but could work as workaround.
If yes, how do I turn it on during training?