glample / fastBPE

Fast BPE
MIT License
656 stars 96 forks source link

Differences with subword-nmt #13

Closed loretoparisi closed 4 years ago

loretoparisi commented 5 years ago

I have found that loading a fastBPE codes and vocabulary against subword-nmt I get a different result in the bpe codes:

Using fastBPE

hoy quiero que te qu@@ ede &@@ apo@@ s@@ ; a dormir
this song is gonna make you mad

Using subword-nmt

ho@@ y qui@@ ero que te que@@ de &@@ apo@@ s@@ ; a dor@@ mir
th@@ is son@@ g is gon@@ na make you mad

using the same codes and vocabulary, with minimal adaptation in the latter package. My understanding of BPE was that the implementation should be almost the same. I have asked subword-nmt author as well: https://github.com/rsennrich/subword-nmt/issues/76

loretoparisi commented 5 years ago

[UPDATE] Considering the new Python API Wrapper subword-nmt is not necessary anymore, by the way would be interesting to understand those differences!

Thanks a lot!

glample commented 5 years ago

How large is the dataset on which you learned the BPE codes? I believe in the original implementation they do not merge two BPE splits if the resulting word only appear once in the original corpus. This will make a difference for small training datasets (typically, if the word "gonna" only appears once in your training set), but otherwise there will be no difference.

loretoparisi commented 5 years ago

@glample it's FAIR LASER :) So codes and vocabulary are

Codes: https://dl.fbaipublicfiles.com/laser/models/93langs.fcodes Vocab: https://dl.fbaipublicfiles.com/laser/models/93langs.fvocab

root@3f40ea8e2cc4:/tornado_api# head -n10 /root/laser_models/93langs.fvocab 
. 87264459
, 78156033
de 19001435
- 13731976
? 13338524
a 13062980
i 8917603
en 8272731
" 8258142
la 7623301
root@3f40ea8e2cc4:/tornado_api# head -n10 /root/laser_models/93langs.fcodes 
e n 52708119
e r 51024442
e n</w> 47209692
a n 46619244
i n 44583543
s t 42633672
a r 34974160
o n 31941788
t i 30717853
d e 30509691
zpppy commented 5 years ago

@glample it's FAIR LASER :) So codes and vocabulary are

Codes: https://dl.fbaipublicfiles.com/laser/models/93langs.fcodes Vocab: https://dl.fbaipublicfiles.com/laser/models/93langs.fvocab

root@3f40ea8e2cc4:/tornado_api# head -n10 /root/laser_models/93langs.fvocab 
. 87264459
, 78156033
de 19001435
- 13731976
? 13338524
a 13062980
i 8917603
en 8272731
" 8258142
la 7623301
root@3f40ea8e2cc4:/tornado_api# head -n10 /root/laser_models/93langs.fcodes 
e n 52708119
e r 51024442
e n</w> 47209692
a n 46619244
i n 44583543
s t 42633672
a r 34974160
o n 31941788
t i 30717853
d e 30509691

why subword-nmt‘s vocab has label "@@",but fastBPE has no.

loretoparisi commented 4 years ago

closing this because solved my issue. Not sure about @zpppy question.