applying BPE(Byte Pair Encoding) fails for large Chinese data tokenized with THULAC

rsennrich / subword-nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation

MIT License

2.19k stars 465 forks source link

applying BPE(Byte Pair Encoding) fails for large Chinese data tokenized with THULAC #113

Closed caesardai closed 2 years ago

caesardai commented 2 years ago

I build the bpe vocabulary by doing the source and target vocabularies, in my case, English and tokenized Chinese(with THULAC). Applying bpe works for both English and Chinese data when their size is 1000000 lines, but it stops working after that number gets larger. I tried manipulating the bpe vocabulary size of 32000, 64000, etc. as well but didn't work. Anyone know a fix? Thanks!

rsennrich commented 2 years ago

Hello Xufeng,

are you specifically talking about apply_bpe or train_bpe? And have you checked memory utilization? Both scripts use some caching that can make memory requirements increase with text size, but I'd expect you to have higher memory requirements with train_bpe. (Also, you could easily disable the cache for apply_bpe, but train_bpe relies more heavily on caching for speed).