training corpus stuck - Githubissues

kpu / kenlm

KenLM: Faster and Smaller Language Model Queries

http://kheafield.com/code/kenlm/

Other

2.51k stars 511 forks source link

training corpus stuck #208

Open CuriousDeepLearner opened 5 years ago

CuriousDeepLearner commented 5 years ago

I tried to train a langague model with a corpus but seems it stucks at the beginning. Couldn't investigate the cause. bzcat clean_corpus.tar.bz2 | python process.py | kenlm/build/bin/lmplz -S 8G -o 5 > spanish_5gram.arpa

It stucks at this step: === 1/5 Counting and sorting n-grams === File stdin isn't normal. Using slower read() instead of mmap(). No progress bar. tcmalloc: large alloc 1511432192 bytes == 0x56104f802000 @ 0x7f70271e31e7 0x56104d4847e2 0x56104d419368 0x56104d3f81f6 0x56104d3e40d6 0x7f702537cb97 0x56104d3e5b1a tcmalloc: large alloc 7053344768 bytes == 0x5610a996c000 @ 0x7f70271e31e7 0x56104d4847e2 0x56104d46f6ca 0x56104d4700e8 0x56104d3f8213 0x56104d3e40d6 0x7f702537cb97 0x56104d3e5b1a

kpu commented 5 years ago

Does your python program terminate? If you replace python process.py with cat does it work?

CuriousDeepLearner commented 5 years ago

No. Nothing changes. It still stucks at this point. I followed this tuto https://yidatao.github.io/2017-05-31/kenlm-ngram/

CuriousDeepLearner commented 5 years ago

@kpu For example, if I run model on a corpus of 1 line : marines están habilitando un emplazamiento donde reagrupar a unos digito digito digito combatientes de al qaida susceptibles de rendirse o de caer prisioneros

The problem still stays the same. I even tried on colab and same prob.

kpu commented 5 years ago

Have you run

kenlm/build/bin/lmplz -S 8G -o 5 <README.md  >spanish_5gram.arpa

And how much RAM do you have?

CuriousDeepLearner commented 5 years ago

for the corpus in file text, I think I found why it didn't work. In fact it lacks < > before and end of text.txt. so it must be kenlm/build/bin/lmplz -S 8G -o 5 <text.txt >spanish_5gram.arpa

But if the corpus is compressed in file .tar for example, I don't know how to fix it. @kpu, do you have any idea? How could we run the model without uncompressing the file bzcat clean_corpus.tar.bz2 | python process.py | kenlm/build/bin/lmplz -S 8G -o 5 > spanish_5gram.arpa

kpu commented 5 years ago

Does this work?

cat README.md |build/bin/lmplz --discount_fallback -o 5 >/dev/null

CuriousDeepLearner commented 5 years ago

@kpu No. It doesn't change anything...

kpu commented 5 years ago

What does it print?

CuriousDeepLearner commented 5 years ago

It prints like before:

=== 1/5 Counting and sorting n-grams === File stdin isn't normal. Using slower read() instead of mmap(). No progress bar

kpu commented 5 years ago

Something doesn't smell right and I'm unable to reproduce this. Is this running on Windows or something?

GingerNg commented 5 years ago

I just encounter this phenomenon, I run the program on ubuntu.