Closed kayoyin closed 5 years ago
Hi,
This does not seem surprising to me. This is probably because your vocabulary is very small. You seem to have just a few thousand words, and you use 60000 BPE splits. BPE are interesting when the vocabulary size is huge (like hundreds of thousands which is too slow for a regular softmax). Try to learn 500 instead of 60000 BPE codes and see if you obtain something different?
Yes, that was exactly the problem thank you so much! 500 seems to work great, is there any rule to help choose a good number of BPE codes to learn?
I don't think there is a clear rule. Usually it's good to try a few different number of codes and see what works best :)
@glample Hi glample, if the words in my corpus are processed from the origin words, looks like here
12751 3191 2273 4939 20743 2289 19864 20588 4846 13300 20201 4939 11612 21224 4939 20677 20583 21224 7747 2379 9391 12093 15274
are there any option to solve the problem?
@glample how could we take the number as a single word? It seems that in the code, will separate the number.
Hello,
I am trying to use FastBPE for unsupervised NMT. After learning the codes with
when I call
applybpe' as below
, the output I get inclozebpe
is identical to../data/cloze.txt
.This isn't the expected behavior right? How do I split words in my input text to subword units?