Closed Skylixia closed 4 years ago
This is probably a result of running out of memory. learn_bpe.py caches the vocabulary for efficiency, which makes memory consumption increase for larger corpora. If you have access to a machine with more RAM, use that. If you're working on a shared machine, check if there's other processes running that require lots of RAM.
If you can't get access to more RAM, options include learning BPE only on a random subset of the corpus (which should not have a big effect on its quality), or trying different BPE implementations (although I don't know how they fare in terms of memory efficiency).
This indeed solves the issue ! Thank you
Hello,
I am trying to generate a vocab.bpe file from a large corpus in Dutch to use for GPT-2 encoding. I use the following command in subword-nmt/subword_nmt : python learn_bpe.py -o ./vocab.bpe -i corpus --symbols 50000 Then after a while "Killed" is printed on the terminal How should I proceed ?