Problem with a large corpus

nguyenvulebinh commented 5 years ago

I have a large corpus, around 40GB of text. I install subword-nmt via pip and try to make the dictionary with subword-nmt command line and it takes forever to finish. I just wonder whether there any solution for that situation?

rsennrich commented 5 years ago

here's some suggestions:

subword-nmt will run faster (around 50%) with PyPy than with CPython: https://pypy.org/
fastBPE is a C++ re-implementation that will be 5-10x faster: https://github.com/glample/fastBPE
subword-nmt is optimized for use cases where words are separated by whitespace, and ideally tokenized. If this is not the case in your corpus, consider tokenizing it first (e.g. using Moses tokenizer or Jieba for Chinese), or consider SentencePiece, which performs both tokenization and subword segmentation: https://github.com/google/sentencepiece
independent of the implemenation, you can also learn the BPE segmentation on a random subset of your original corpus. I doubt that the quality of the segmentation will change much between training it on 1 million or 100 million sentences of data.

nguyenvulebinh commented 5 years ago

Really appreciate that! Thank you for your help! I'll try it.

rsennrich / subword-nmt

Problem with a large corpus #77