[Bug] Incorrect implementation of conditional probability in Naive Bayes classifier

joelgrus / data-science-from-scratch

code for Data Science From Scratch book

MIT License

8.56k stars 4.48k forks source link

The correct way to calculate P(token | spam) is,

Message - 1 (spam message)

bitcoin bitcoin bitcoin bitcoin bitcoin testing

Message - 2 (spam message)

bitcoin bitcoin bitcoin bitcoin bitcoin testing

Message - 3 (ham message)

A genuine mail

In brief, spam messages = 2 ham messages = 1

total spam tokens = 12 total ham tokens = 3

count of tokens in spam messages = { 'bitcoin': 10, 'testing': 2 } count of tokens in ham messages = { 'a': 1, 'genuine': 1, 'mail': 1 }

P(bitcoin | spam): => Count of bitcoin in spam messages / total count of spam tokens => 10 / 12 => 0.833

joelgrus / data-science-from-scratch