Closed caesardai closed 2 years ago
Hello Xufeng,
are you specifically talking about apply_bpe
or train_bpe
? And have you checked memory utilization? Both scripts use some caching that can make memory requirements increase with text size, but I'd expect you to have higher memory requirements with train_bpe
. (Also, you could easily disable the cache for apply_bpe
, but train_bpe
relies more heavily on caching for speed).
I build the bpe vocabulary by doing the source and target vocabularies, in my case, English and tokenized Chinese(with THULAC). Applying bpe works for both English and Chinese data when their size is 1000000 lines, but it stops working after that number gets larger. I tried manipulating the bpe vocabulary size of 32000, 64000, etc. as well but didn't work. Anyone know a fix? Thanks!