kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.5k stars 513 forks source link

.arpa models is lower case only #355

Closed korotaS closed 3 years ago

korotaS commented 3 years ago

Hi! I am using KenLM to create some simple LMs for usual latin/cyrillic data. Input texts are in mixed case with some capital letters. But if I check the resulting .arpa file, all the words and n-grams are in lower case. Is this an expected behavior? And if it is, can I force the LM to use mixed case? The issue is that I am trying to use beam search with KenLM for OCR outputs (which are in full case) and thus it is better for me to have the LM with full case.

kpu commented 3 years ago

This is not expected behavior. The toolkit does not know how to lowercase data.

Unable to reproduce:

$ exec bash
$ bin/lmplz --discount_fallback -o 2 <<<"Test" 2>/dev/null
\data\
ngram 1=4
ngram 2=2

\1-grams:
-0.7781512      <unk>   0
0       <s>     -0.30103
-0.38021123     </s>    0
-0.38021123     Test    -0.30103

\2-grams:
-0.1497623      Test </s>
-0.1497623      <s> Test

\end\
korotaS commented 3 years ago

@kpu thank you for the quick reply! I checked again my code and models and found out that everything is OK, there was a '.lower()' piece of code in tokenization, which I thought I have disabled.