Closed loretoparisi closed 4 years ago
[UPDATE]
Considering the new Python API Wrapper subword-nmt
is not necessary anymore, by the way would be interesting to understand those differences!
Thanks a lot!
How large is the dataset on which you learned the BPE codes? I believe in the original implementation they do not merge two BPE splits if the resulting word only appear once in the original corpus. This will make a difference for small training datasets (typically, if the word "gonna" only appears once in your training set), but otherwise there will be no difference.
@glample it's FAIR LASER :) So codes and vocabulary are
Codes: https://dl.fbaipublicfiles.com/laser/models/93langs.fcodes Vocab: https://dl.fbaipublicfiles.com/laser/models/93langs.fvocab
root@3f40ea8e2cc4:/tornado_api# head -n10 /root/laser_models/93langs.fvocab
. 87264459
, 78156033
de 19001435
- 13731976
? 13338524
a 13062980
i 8917603
en 8272731
" 8258142
la 7623301
root@3f40ea8e2cc4:/tornado_api# head -n10 /root/laser_models/93langs.fcodes
e n 52708119
e r 51024442
e n</w> 47209692
a n 46619244
i n 44583543
s t 42633672
a r 34974160
o n 31941788
t i 30717853
d e 30509691
@glample it's FAIR LASER :) So codes and vocabulary are
Codes: https://dl.fbaipublicfiles.com/laser/models/93langs.fcodes Vocab: https://dl.fbaipublicfiles.com/laser/models/93langs.fvocab
root@3f40ea8e2cc4:/tornado_api# head -n10 /root/laser_models/93langs.fvocab . 87264459 , 78156033 de 19001435 - 13731976 ? 13338524 a 13062980 i 8917603 en 8272731 " 8258142 la 7623301 root@3f40ea8e2cc4:/tornado_api# head -n10 /root/laser_models/93langs.fcodes e n 52708119 e r 51024442 e n</w> 47209692 a n 46619244 i n 44583543 s t 42633672 a r 34974160 o n 31941788 t i 30717853 d e 30509691
why subword-nmt‘s vocab has label "@@",but fastBPE has no.
closing this because solved my issue. Not sure about @zpppy question.
I have found that loading a
fastBPE
codes and vocabulary againstsubword-nmt
I get a different result in the bpe codes:Using fastBPE
Using subword-nmt
using the same codes and vocabulary, with minimal adaptation in the latter package. My understanding of
BPE
was that the implementation should be almost the same. I have askedsubword-nmt
author as well: https://github.com/rsennrich/subword-nmt/issues/76