How to prepare input/dataset for buiding a character level n-gram LM?

kpu / kenlm

KenLM: Faster and Smaller Language Model Queries

http://kheafield.com/code/kenlm/

Other

2.51k stars 511 forks source link

How to prepare input/dataset for buiding a character level n-gram LM? #281

Closed samin9796 closed 4 years ago

samin9796 commented 4 years ago

I want to build a character level 20 gram LM. What things are different in this case from building a word-level LM?

amitbcp commented 4 years ago

@samin9796 can you please share how you did this ?

samin9796 commented 4 years ago

@amitbcp In case of word-level LM, you have thousands of sentences in a text file and these sentences are separated by words. To prepare for a character-level LM, you just need to separate the words into characters. For example: A p p l e i s a f r u i t All the characters are separated by spaces. The rest (how to build using kenlm) is exactly same for both types of LMs.

kpu commented 4 years ago

Bonus points for mapping space to a token like <space>

amitbcp commented 4 years ago

Thanks @samin9796 @kpu