Closed 00001101-xt closed 4 years ago
Can you isolate that exact lines (and maybe some context) then attach a file?
Off the wall guess is vertical tab. Apparently SRILM doesn't consider a vertical tab to be whitespace, so I don't either. But Kaldi might consider it to be whitespace.
Hi, @kpu I'm sorry the ARPA file cannot be attached due to some privacy issues.
Here is what I did next:
- zcat lm.gz | head -n 56 > temp (since "line 55 [-8.702529 ]: Invalid n-gram data line")
- opened file 'temp' in python3 in UTF-8 encoding and print the line, the output is listed below:
'-8.702529\t\n'
Hi, @kpu I'm sorry the ARPA file cannot be attached due to some privacy issues.
Here is what I did next:
- zcat lm.gz | head -n 56 > temp (since "line 55 [-8.702529 ]: Invalid n-gram data line") - opened file 'temp' in python3 in UTF-8 encoding and print the line, the output is listed below: '-8.702529\t\n'
After deleting the 55th line, there is another similar issue at line 1417117:
'-0.4900205\t 最新进展\n'
['-', '0', '.', '4', '9', '0', '0', '2', '0', '5', '\t', ' ', '最', '新', '进', '展', '\n']
About the above n-gram, there exists an extra whitespace right after the '\t', is this why Kaldi reported as an invalid n-gram line?
Deleting the extra whitespace didn't help here in -0.4900205\t 最新进展\n
because the line is considered as 2-gram, without the whitespace it is apparently an unigram. An unigram cannot be parsed as 2-gram so Kaldi will report this as an issue according to line 182 of arpa-file-parser.cc.
Delete the invalid n-gram data line(s).
Hello, everyone.
I trained a language model using this KenLM toolkit on a large corpus(>150G), then pruned at some threshold such as 5e-9.
However, when converting to
fst
usingutils/format_lm.sh
in Kaldi, I encountered the issue sayingline 55 [-8.702529 ]: Invalid n-gram data line
. I tried to delete that line manually, but there are some other invalid n-gramsline 1417117 [-0.4900205 最新进展]: Invalid n-gram data line
.So I wonder where might be the problem with my text corpus? Also, why Kaldi found this is an issue while KenLM and SRILM didn't at my training and pruning stage?
Thanks.