joelgrus / data-science-from-scratch

code for Data Science From Scratch book
MIT License
8.56k stars 4.48k forks source link

[Bug] Incorrect implementation of conditional probability in Naive Bayes classifier #129

Open lokesh-vr-17773 opened 1 month ago

lokesh-vr-17773 commented 1 month ago

In _probabilities method, the probabilities might go over 1 for this case.

Consider there are three messages in our train dataset, of which one is ham and remaining two are spam. the spam messages contain 'bitcoin' multiple times, let's say the count of word bitcoin in spam messages are 10. In brief,

ham messages = 1 spam messages = 2 count of bitcoint token = 10

then,

p_token_spam = (spam + self.k) / (self.spam_messages + 2 * self.k) # k -> smoothening factor = 0.5
p_token_spam = (10 + 0.5) / (2 + 2 * 0.5) = 10.5 / 3 = 3.5

Since probabilities cannot go above 1, how should we interpret 3.5 in this case?

lokesh-vr-17773 commented 1 month ago

The correct way to calculate P(token | spam) is,

Message - 1 (spam message)

bitcoin bitcoin bitcoin bitcoin bitcoin testing

Message - 2 (spam message)

bitcoin bitcoin bitcoin bitcoin bitcoin testing

Message - 3 (ham message)

A genuine mail

In brief, spam messages = 2 ham messages = 1

total spam tokens = 12 total ham tokens = 3

count of tokens in spam messages = { 'bitcoin': 10, 'testing': 2 } count of tokens in ham messages = { 'a': 1, 'genuine': 1, 'mail': 1 }

P(bitcoin | spam): => Count of bitcoin in spam messages / total count of spam tokens => 10 / 12 => 0.833