kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.51k stars 512 forks source link

Word not seen in unigrams error #330

Open m-sean opened 3 years ago

m-sean commented 3 years ago

I've recently used kenlm to build a few language models, and suddenly I started to get this error. On multiple occasions when trying to build a trie, it seems to think a word is showing up in the trigrams that wasn't seen in the unigrams.

bin/lmplz -o 3 -S 80% -T /tmp <2009_2016-2020.txt > 2009_2016-2020.arpa; 
bin/build_binary -T /tmp/trie -S 80% trie 2009_2016-2020.arpa 2009_2016-2020.bin
...
kenlm/lm/read_arpa.hh:84 in void lm::ReadNGram(util::FilePiece &, const unsigned char, const Voc &, Iterator, Weights &, lm::PositiveProbWarn &) [Voc = lm::ngram::SortedVocabulary, Weights = lm::Prob, Iterator = std::__1::reverse_iterator<unsigned int *>] threw FormatLoadException because `index == 0 && (word != StringPiece("<unk>", 5)) && (word != StringPiece("<UNK>", 5))'.
Word including was not seen in the unigrams (which are supposed to list the entire vocabulary) but appears in the 3-gram at byte 18426390635 Byte: 18426390635
ERROR

But i've just checked the arpa file, and the word is listed in the unigrams: Word including was found in line number 12875 of the arpa file: -4.057281 including -0.6523366 as well as many bigrams... What could be causing this?

kmosunoff commented 2 years ago

@m-sean are there any updates on the issue? I've just faced the same one, while building binary (as a trie) from 1.2 TB 5-gram ARPA file generated by KenLM.

I've recently used kenlm to build a few language models, and suddenly I started to get this error. On multiple occasions when trying to build a trie, it seems to think a word is showing up in the trigrams that wasn't seen in the unigrams.

bin/lmplz -o 3 -S 80% -T /tmp <2009_2016-2020.txt > 2009_2016-2020.arpa; 
bin/build_binary -T /tmp/trie -S 80% trie 2009_2016-2020.arpa 2009_2016-2020.bin
...
kenlm/lm/read_arpa.hh:84 in void lm::ReadNGram(util::FilePiece &, const unsigned char, const Voc &, Iterator, Weights &, lm::PositiveProbWarn &) [Voc = lm::ngram::SortedVocabulary, Weights = lm::Prob, Iterator = std::__1::reverse_iterator<unsigned int *>] threw FormatLoadException because `index == 0 && (word != StringPiece("<unk>", 5)) && (word != StringPiece("<UNK>", 5))'.
Word including was not seen in the unigrams (which are supposed to list the entire vocabulary) but appears in the 3-gram at byte 18426390635 Byte: 18426390635
ERROR

But i've just checked the arpa file, and the word is listed in the unigrams: Word including was found in line number 12875 of the arpa file: -4.057281 including -0.6523366 as well as many bigrams... What could be causing this?

m-sean commented 2 years ago

not from me, unfortunately. I ended up working with a colleague to develop more light-weight language models from scratch. have you tried the default structure as well as the trie?