Closed korotaS closed 3 years ago
This is not expected behavior. The toolkit does not know how to lowercase data.
Unable to reproduce:
$ exec bash
$ bin/lmplz --discount_fallback -o 2 <<<"Test" 2>/dev/null
\data\
ngram 1=4
ngram 2=2
\1-grams:
-0.7781512 <unk> 0
0 <s> -0.30103
-0.38021123 </s> 0
-0.38021123 Test -0.30103
\2-grams:
-0.1497623 Test </s>
-0.1497623 <s> Test
\end\
@kpu thank you for the quick reply! I checked again my code and models and found out that everything is OK, there was a '.lower()' piece of code in tokenization, which I thought I have disabled.
Hi! I am using KenLM to create some simple LMs for usual latin/cyrillic data. Input texts are in mixed case with some capital letters. But if I check the resulting .arpa file, all the words and n-grams are in lower case. Is this an expected behavior? And if it is, can I force the LM to use mixed case? The issue is that I am trying to use beam search with KenLM for OCR outputs (which are in full case) and thus it is better for me to have the LM with full case.