Say there is the sentence: [w1, w2, w3, w4, w5]. And we already have a 3-gram language model. After load the language model, according to the source code, we shall get the probability of the sentence as:
I can understand that P(w1, w2, w3) above which is got from bayesian chain rule.
But why should P(sentence) be calculated this way? Is there any paper I can find to support this formula?
Say there is the sentence:
[w1, w2, w3, w4, w5]
. And we already have a 3-gram language model. After load the language model, according to the source code, we shall get the probability of the sentence as:P(sentence) = P(<s>, w1, w2) * P(w1, w2, w3) * P(w2, w3, w4) * P(w3, w4, w5) * P(w4, w5, </s>)
while,
I can understand that P(w1, w2, w3) above which is got from bayesian chain rule. But why should P(sentence) be calculated this way? Is there any paper I can find to support this formula?
Thank you!