Closed renepickhardt closed 9 years ago
I do not understand this, can you explain it with a bit more detail? Is this still relevant?
I think we have resolved this already. The thing was in the marginal case that if the conditional is not seen we have to switch to another probability distribution for this particular set of parameters. I think currently we do backofs and in the twogram case unseen history cannot exist because we do not support UNK yet.
IMO not supporting UNK does not mean we will only ever query sequences that guarantee that the conditional sequence (= history?) is seen. We currently handle this in class SubstituteEstimator
and always do Unigram smoothing. I might be possible to change this smoothing via config or argument but I'm not a big fan of that as it adds a ton of complexity for users.
lets discuss this in the next meeting face2face. I thought we do backoffs which would be differnt from unigram smoothing. so when not seeing a history of 2 words. one can backoff to the twogram distribution instead of the unigram distribution.
if the conditional sequence is not seen. there should be ether the uniform distribution (1/vocabSize) or the Unigram distribution of the corpus (MLE or ContCount).
this should be configurable via config file