kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.5k stars 513 forks source link

Is there any way to create character-level language model instead of word-level? #290

Closed DRosemei closed 4 years ago

DRosemei commented 4 years ago

I used sentences like this: "T h e y h a v e t h e b l o o d o f m a r t y r s i s t h e W h i t e t o f l o w" While an error occurred: === 1/5 Counting and sorting n-grams === Reading /data00/home/meiruohong/projects/language_model/kenlm/build/lm_path/letter.txt ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100


Unigram tokens 9900 types 85 === 2/5 Calculating and sorting adjusted counts === Chain sizes: 1:1020 2:1292080128 3:2422650368 4:3876240384 5:5652851200 /home/meiruohong/projects/language_model/kenlm/lm/builder/adjustcounts.cc:60 in void lm::builder::{anonymous}::StatCollector::CalculateDiscounts(const lm::builder::DiscountConfig&) threw BadDiscountException because `discounts[i].amount[j] < 0.0 || discounts_[i].amount[j] > j'. ERROR: 1-gram discount out of range for adjusted count 3: -1.173913 Aborted (core dumped) So is there any way to create character-level language model instead of word-level ?

kpu commented 4 years ago

--discount_fallback will let you make discounts up. I've added it to the error message.

DRosemei commented 4 years ago

--discount_fallback will let you make discounts up. I've added it to the error message.

Thanks:)

DRosemei commented 4 years ago

Hi @kpu , I used character-level language model and word-level language model to predict strings, and I found that character-level language model could not predict space between character while word-level language mode could. So I want to know whether there are any problems?

kpu commented 4 years ago

Make up a token for space between words.

DRosemei commented 4 years ago

Make up a token for space between words.

Thanks and I have already done this:)

RamanHacks commented 3 years ago

@DRosemei were you able to train character-level language model for English? If yes, can you please share it?

MagedSaeed commented 1 year ago

@DRosemei were you able to train character-level language model for English? If yes, can you please share it?

I just added --discount_fallback command and it worked!