kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.5k stars 513 forks source link

Invalid n-gram in ARPA #247

Closed 00001101-xt closed 4 years ago

00001101-xt commented 4 years ago

Hello, everyone.

I trained a language model using this KenLM toolkit on a large corpus(>150G), then pruned at some threshold such as 5e-9.

However, when converting to fst using utils/format_lm.sh in Kaldi, I encountered the issue saying line 55 [-8.702529 ]: Invalid n-gram data line. I tried to delete that line manually, but there are some other invalid n-grams line 1417117 [-0.4900205 最新进展]: Invalid n-gram data line.

So I wonder where might be the problem with my text corpus? Also, why Kaldi found this is an issue while KenLM and SRILM didn't at my training and pruning stage?

# some commands
lmplz -o 3 < text > lm.arpa
gzip -c lm.arpa > lm.gz
ngram -lm lm.gz -prune 5e-9 -write-lm lm_pruned.gz
# at last arpa2fst to convert lm_pruned.gz to fst 

Thanks.

kpu commented 4 years ago

Can you isolate that exact lines (and maybe some context) then attach a file?

Off the wall guess is vertical tab. Apparently SRILM doesn't consider a vertical tab to be whitespace, so I don't either. But Kaldi might consider it to be whitespace.

00001101-xt commented 4 years ago

Hi, @kpu I'm sorry the ARPA file cannot be attached due to some privacy issues.

Here is what I did next:

- zcat lm.gz | head -n 56 > temp (since "line 55 [-8.702529 ]: Invalid n-gram data line")
- opened file 'temp' in python3 in UTF-8 encoding and print the line, the output is  listed below:

'-8.702529\t\n'
00001101-xt commented 4 years ago

Hi, @kpu I'm sorry the ARPA file cannot be attached due to some privacy issues.

Here is what I did next:

- zcat lm.gz | head -n 56 > temp (since "line 55 [-8.702529   ]: Invalid n-gram data line")
- opened file 'temp' in python3 in UTF-8 encoding and print the line, the output is  listed below:

'-8.702529\t\n'

After deleting the 55th line, there is another similar issue at line 1417117:

'-0.4900205\t 最新进展\n'
['-', '0', '.', '4', '9', '0', '0', '2', '0', '5', '\t', ' ', '最', '新', '进', '展', '\n']

About the above n-gram, there exists an extra whitespace right after the '\t', is this why Kaldi reported as an invalid n-gram line?

00001101-xt commented 4 years ago

Deleting the extra whitespace didn't help here in -0.4900205\t 最新进展\n because the line is considered as 2-gram, without the whitespace it is apparently an unigram. An unigram cannot be parsed as 2-gram so Kaldi will report this as an issue according to line 182 of arpa-file-parser.cc.

Solution:

Delete the invalid n-gram data line(s).