rsennrich / subword-nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
MIT License
2.18k stars 464 forks source link

BUG : Generating a vocab.bpe file "Killed" #96

Closed Skylixia closed 4 years ago

Skylixia commented 4 years ago

Hello,

I am trying to generate a vocab.bpe file from a large corpus in Dutch to use for GPT-2 encoding. I use the following command in subword-nmt/subword_nmt : python learn_bpe.py -o ./vocab.bpe -i corpus --symbols 50000 Then after a while "Killed" is printed on the terminal How should I proceed ?

rsennrich commented 4 years ago

This is probably a result of running out of memory. learn_bpe.py caches the vocabulary for efficiency, which makes memory consumption increase for larger corpora. If you have access to a machine with more RAM, use that. If you're working on a shared machine, check if there's other processes running that require lots of RAM.

If you can't get access to more RAM, options include learning BPE only on a random subset of the corpus (which should not have a big effect on its quality), or trying different BPE implementations (although I don't know how they fare in terms of memory efficiency).

Skylixia commented 4 years ago

This indeed solves the issue ! Thank you