kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.5k stars 513 forks source link

Don't prune n-grams containing certain words #214

Closed PCerles closed 5 years ago

PCerles commented 5 years ago

Hi, I have a very large corpus that I want to train an n-gram language model on. I want to prune for efficient STT decoding, but I don't want to do any pruning on n-grams that contain certain key words. Is there a way to do this directly with kenlm?

kpu commented 5 years ago

Requires code modification. https://github.com/kpu/kenlm/blob/master/lm/builder/adjust_counts.cc have it not mark the stuff you want. Have fun!

PCerles commented 5 years ago

Thanks!