I have a large corpus, around 40GB of text. I install subword-nmt via pip and try to make the dictionary with subword-nmt command line and it takes forever to finish. I just wonder whether there any solution for that situation?
subword-nmt is optimized for use cases where words are separated by whitespace, and ideally tokenized. If this is not the case in your corpus, consider tokenizing it first (e.g. using Moses tokenizer or Jieba for Chinese), or consider SentencePiece, which performs both tokenization and subword segmentation: https://github.com/google/sentencepiece
independent of the implemenation, you can also learn the BPE segmentation on a random subset of your original corpus. I doubt that the quality of the segmentation will change much between training it on 1 million or 100 million sentences of data.
I have a large corpus, around 40GB of text. I install subword-nmt via pip and try to make the dictionary with subword-nmt command line and it takes forever to finish. I just wonder whether there any solution for that situation?