renepickhardt / generalized-language-modeling-toolkit

Generalized Language Modeling toolkit
http://glm.rene-pickhardt.de
52 stars 17 forks source link

pruning of words and n-grams #12

Open renepickhardt opened 10 years ago

renepickhardt commented 10 years ago

we could prune all rare sequences and words. and also we don't have this token which others have. it would be interesting to play around with this. also to see if this increases or rather decreases quality of results...

still pruning to the most frequent words and then using unk for "unknown" tokens should be implemented. especially in mod kneser ney one could focus on gaps.

but also afterwards pruning rare n-grams would be an idea. I expect that in this way generalized language models still perform well but use less space.

this should also be discussed with till

lschmelzeisen commented 9 years ago

I guess we won't have any pruning of counts for stable release?

renepickhardt commented 9 years ago

agree it might be relevant for your bachlorthesis though. Because your beam search - if you use it - will do something similar and saving disk space is especially in the GLM case an important goal.