renepickhardt / generalized-language-modeling-toolkit

Generalized Language Modeling toolkit
http://glm.rene-pickhardt.de
52 stars 17 forks source link

make distribution for non existing parameters configurable #59

Closed renepickhardt closed 9 years ago

renepickhardt commented 10 years ago

if the conditional sequence is not seen. there should be ether the uniform distribution (1/vocabSize) or the Unigram distribution of the corpus (MLE or ContCount).

this should be configurable via config file

lschmelzeisen commented 9 years ago

I do not understand this, can you explain it with a bit more detail? Is this still relevant?

renepickhardt commented 9 years ago

I think we have resolved this already. The thing was in the marginal case that if the conditional is not seen we have to switch to another probability distribution for this particular set of parameters. I think currently we do backofs and in the twogram case unseen history cannot exist because we do not support UNK yet.

lschmelzeisen commented 9 years ago

IMO not supporting UNK does not mean we will only ever query sequences that guarantee that the conditional sequence (= history?) is seen. We currently handle this in class SubstituteEstimator and always do Unigram smoothing. I might be possible to change this smoothing via config or argument but I'm not a big fan of that as it adds a ton of complexity for users.

renepickhardt commented 9 years ago

lets discuss this in the next meeting face2face. I thought we do backoffs which would be differnt from unigram smoothing. so when not seeing a history of 2 words. one can backoff to the twogram distribution instead of the unigram distribution.