glample / fastBPE

Fast BPE
MIT License
656 stars 96 forks source link

Nothing happens after applybpe #24

Closed kayoyin closed 5 years ago

kayoyin commented 5 years ago

Hello,

I am trying to use FastBPE for unsupervised NMT. After learning the codes with

./fast learnbpe 60000 ../data/cloze.txt ../data/natural.txt > codes 

when I call applybpe' as below, the output I get in clozebpe is identical to ../data/cloze.txt.

This isn't the expected behavior right? How do I split words in my input text to subword units?

Screen Shot 2019-07-25 at 17 58 35
glample commented 5 years ago

Hi,

This does not seem surprising to me. This is probably because your vocabulary is very small. You seem to have just a few thousand words, and you use 60000 BPE splits. BPE are interesting when the vocabulary size is huge (like hundreds of thousands which is too slow for a regular softmax). Try to learn 500 instead of 60000 BPE codes and see if you obtain something different?

kayoyin commented 5 years ago

Yes, that was exactly the problem thank you so much! 500 seems to work great, is there any rule to help choose a good number of BPE codes to learn?

glample commented 5 years ago

I don't think there is a clear rule. Usually it's good to try a few different number of codes and see what works best :)

songtaoshi commented 5 years ago

@glample Hi glample, if the words in my corpus are processed from the origin words, looks like here

12751 3191 2273 4939 20743 2289 19864 20588 4846 13300 20201 4939 11612 21224 4939 20677 20583 21224 7747 2379 9391 12093 15274

are there any option to solve the problem?

songtaoshi commented 5 years ago

@glample how could we take the number as a single word? It seems that in the code, will separate the number.