kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.5k stars 513 forks source link

Kneyser_Ney Estimation for Characters #151

Closed numericlee closed 6 years ago

numericlee commented 6 years ago

Can KenLM or Kneyser-Ney Estimation be adapted from modelling language to characters?

My application involves the distribution of characters (say, the digits of a telephone number or street addresses) rather than words which I would like to train .

I have a probability vector for each position based on LeNet5 and an incomplete corpus of valid phone numbers. I would like to model based on sequences of several characters rather than several words.

kpu commented 6 years ago

Yes. Put spaces between letters, run lmplz with --discount_fallback.