joelgrus / data-science-from-scratch

code for Data Science From Scratch book
MIT License
8.71k stars 4.52k forks source link

Issue with spam_probability function (Ch 13, pp. 168-9) #49

Open JDM-GBG opened 6 years ago

JDM-GBG commented 6 years ago

Hi Joel,

I've run into a problem with function that is called after the training process, to evaluate a new message for spamminess. The algorithm as given goes through every word accumulated by the training process, and updates the spam/not-spam probabilities based on that word appearing in the message (log(p)), or based on it not appearing (log(1.0 - p)). All this, as best I can tell, is sound & correct according to the math. Except, my post-training dictionary contains well over 80,000 words. So if you're accumulating probabilities, even if every one of those probabilities were 99%, by the time you combine 80,000 of them Python calculates the resulting probability as 0.0. Even with a dictionary 1/10 that size, the accumulated 99% probabilities would come out on the order of 1E-35. (Naturally, the overwhelming majority of the word-wise probabilities are far less than 99% -- most in fact are below 1%.) But anyway, put in plain English, this means that as your training set grows, the chance of a given message being either spam or not spam, approaches zero! This can't possibly be right.

Could you let me know whether something's off with the math & algorithm given in the book? Or is my understanding of the methodology just off in the woods?