kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.5k stars 513 forks source link

Generating arpa file taking too long #369

Open ashu5644 opened 2 years ago

ashu5644 commented 2 years ago

Hi @kpu , I am trying to generate arpa file of text ~20gb in size. It's taking too long to generate. Initial 4 steps are relatively fast compared to Step:5 "=== 5/5 Writing ARPA model ===", which is too slow. Can you please tell me the rough time for generating arpa file of order 5 for such a large corpus and any methods to do that in faster way? Final aim is to generate a binary from this arpa file.

kpu commented 2 years ago

Disk bound or CPU bound? I had a custom branch with trie generation directly from building but it was too hacky to release.

ashu5644 commented 2 years ago

Disk space is not an issue. RAM availability is also sufficient. But step 5 is updating ARPA file very slowly as ~1MB addition in arpa file taking ~1min. I am unable to figure out cause of slowness , although sufficient free RAM and disk space is available. I am using command: "bin/lmplz -o 5 -S 80% -T ../../dir text.arpa"

maxwellzh commented 2 years ago

I believe this is not code-related issue. I've trained a 5-gram model with 47GB corpus, which takes around 3 hours, and it could be even faster since I set -S 20%.