Open lokesh-vr-17773 opened 1 month ago
The correct way to calculate P(token | spam) is,
Message - 1 (spam message)
bitcoin bitcoin bitcoin bitcoin bitcoin testing
Message - 2 (spam message)
bitcoin bitcoin bitcoin bitcoin bitcoin testing
Message - 3 (ham message)
A genuine mail
In brief, spam messages = 2 ham messages = 1
total spam tokens = 12 total ham tokens = 3
count of tokens in spam messages = { 'bitcoin': 10, 'testing': 2 } count of tokens in ham messages = { 'a': 1, 'genuine': 1, 'mail': 1 }
P(bitcoin | spam): => Count of bitcoin in spam messages / total count of spam tokens => 10 / 12 => 0.833
In
_probabilities
method, the probabilities might go over 1 for this case.Consider there are three messages in our train dataset, of which one is ham and remaining two are spam. the spam messages contain 'bitcoin' multiple times, let's say the count of word
bitcoin
in spam messages are 10. In brief,ham messages = 1 spam messages = 2 count of
bitcoint
token = 10then,
Since probabilities cannot go above 1, how should we interpret 3.5 in this case?