marian-nmt / marian

Fast Neural Machine Translation in C++
https://marian-nmt.github.io
Other
1.22k stars 228 forks source link

USE_SENTENCEPIECE has a bug. #350

Closed Chen1399 closed 3 years ago

Chen1399 commented 3 years ago

There is a bug in USE_SENTENCEPIECE, when line encode(sentencepiece_vacab.cpp). Encoding from token to id has bug, beacause the id is in vocab of spm file which isn't vocab.yml. The id is error. It should be encode to string. Then the string map to id by defaultVocab which from vocab.yml. I'm not good at English. I hope you can understand

snukky commented 3 years ago

I'm not sure I understand what the problem is. Could you provide an example with a clear input and expected/obtained outputs? Please also check if the issue still exists in https://github.com/marian-nmt/marian-dev, the SentencePiece there has been updated recently.