Open CuriousDeepLearner opened 5 years ago
Does your python program terminate? If you replace python process.py
with cat
does it work?
No. Nothing changes. It still stucks at this point. I followed this tuto https://yidatao.github.io/2017-05-31/kenlm-ngram/
@kpu For example, if I run model on a corpus of 1 line :
marines están habilitando un emplazamiento donde reagrupar a unos digito digito digito combatientes de al qaida susceptibles de rendirse o de caer prisioneros
The problem still stays the same. I even tried on colab and same prob.
Have you run
kenlm/build/bin/lmplz -S 8G -o 5 <README.md >spanish_5gram.arpa
And how much RAM do you have?
for the corpus in file text, I think I found why it didn't work. In fact it lacks < >
before and end of text.txt
. so it must be kenlm/build/bin/lmplz -S 8G -o 5 <text.txt >spanish_5gram.arpa
But if the corpus is compressed in file .tar for example, I don't know how to fix it. @kpu, do you have any idea? How could we run the model without uncompressing the file
bzcat clean_corpus.tar.bz2 | python process.py | kenlm/build/bin/lmplz -S 8G -o 5 > spanish_5gram.arpa
Does this work?
cat README.md |build/bin/lmplz --discount_fallback -o 5 >/dev/null
@kpu No. It doesn't change anything...
What does it print?
It prints like before:
=== 1/5 Counting and sorting n-grams === File stdin isn't normal. Using slower read() instead of mmap(). No progress bar
Something doesn't smell right and I'm unable to reproduce this. Is this running on Windows or something?
I just encounter this phenomenon, I run the program on ubuntu.
I tried to train a langague model with a corpus but seems it stucks at the beginning. Couldn't investigate the cause.
bzcat clean_corpus.tar.bz2 | python process.py | kenlm/build/bin/lmplz -S 8G -o 5 > spanish_5gram.arpa
It stucks at this step:
=== 1/5 Counting and sorting n-grams === File stdin isn't normal. Using slower read() instead of mmap(). No progress bar. tcmalloc: large alloc 1511432192 bytes == 0x56104f802000 @ 0x7f70271e31e7 0x56104d4847e2 0x56104d419368 0x56104d3f81f6 0x56104d3e40d6 0x7f702537cb97 0x56104d3e5b1a tcmalloc: large alloc 7053344768 bytes == 0x5610a996c000 @ 0x7f70271e31e7 0x56104d4847e2 0x56104d46f6ca 0x56104d4700e8 0x56104d3f8213 0x56104d3e40d6 0x7f702537cb97 0x56104d3e5b1a