I've run into a problem with function that is called after the training process, to evaluate a new message for spamminess. The algorithm as given goes through every word accumulated by the training process, and updates the spam/not-spam probabilities based on that word appearing in the message (log(p)), or based on it not appearing (log(1.0 - p)). All this, as best I can tell, is sound & correct according to the math.
Except, my post-training dictionary contains well over 80,000 words. So if you're accumulating probabilities, even if every one of those probabilities were 99%, by the time you combine 80,000 of them Python calculates the resulting probability as 0.0. Even with a dictionary 1/10 that size, the accumulated 99% probabilities would come out on the order of 1E-35. (Naturally, the overwhelming majority of the word-wise probabilities are far less than 99% -- most in fact are below 1%.)
But anyway, put in plain English, this means that as your training set grows, the chance of a given message being either spam or not spam, approaches zero! This can't possibly be right.
Could you let me know whether something's off with the math & algorithm given in the book? Or is my understanding of the methodology just off in the woods?
Hi Joel,
I've run into a problem with function that is called after the training process, to evaluate a new message for spamminess. The algorithm as given goes through every word accumulated by the training process, and updates the spam/not-spam probabilities based on that word appearing in the message (log(p)), or based on it not appearing (log(1.0 - p)). All this, as best I can tell, is sound & correct according to the math. Except, my post-training dictionary contains well over 80,000 words. So if you're accumulating probabilities, even if every one of those probabilities were 99%, by the time you combine 80,000 of them Python calculates the resulting probability as 0.0. Even with a dictionary 1/10 that size, the accumulated 99% probabilities would come out on the order of 1E-35. (Naturally, the overwhelming majority of the word-wise probabilities are far less than 99% -- most in fact are below 1%.) But anyway, put in plain English, this means that as your training set grows, the chance of a given message being either spam or not spam, approaches zero! This can't possibly be right.
Could you let me know whether something's off with the math & algorithm given in the book? Or is my understanding of the methodology just off in the woods?