johnmyleswhite / ML_for_Hackers

Code accompanying the book "Machine Learning for Hackers"
http://shop.oreilly.com/product/0636920018483.do
3.67k stars 2.22k forks source link

Chapter 3: Contents of spam.df don't match output in book #35

Open ChrisHowlin opened 8 years ago

ChrisHowlin commented 8 years ago

In Chapter 3 we construct a spam filter based on the data in the folder:

ML_for_Hackers/03-Classification/data/spam

In the book, the terms in these emails are ordered by occurrence with the command below. The book lists the following table with html at the top:

head(spam.df[with(spam.df, order(-occurrence)),])

term frequency density occurrence
2122 html 377 0.005665595 0.338
538 body 324 0.004869105 0.298
4313 table 1182 0.017763217 0.284
1435 email 661 0.009933576 0.262
1736 font 867 0.013029365 0.262
1942 head 254 0.003817138 0.246

When running the code directly, this does not match the output I get with email at the top:

term frequency density occurrence
7781 email 813 0.005853680 0.566
18809 please 425 0.003060042 0.508
14720 list 409 0.002944840 0.444
27309 will 828 0.005961681 0.422
3060 body 379 0.002728837 0.408
9457 free 539 0.003880853 0.390

This seems to be explained by the way the document vectors are processed with the removePunctuation setting. This punctuation is removed and any terms which were separated would now be a new term. For example, becomes htmlhead. The result is that instead of html being listed as a common term in many of the emails, we have lots of low frequency combination of html with other HTML tag keywords.

IbrahimZamit commented 8 years ago

@ChrisHowlin and what seems to be the solution for this issue in order to obtain the same results as the book ??!!!

NumberOne925 commented 6 years ago

Do you guys know, why i have a different results than in the book? why is this happening? data mining1

pythonandr commented 6 years ago

@NumberOne925 I got the same result as yours. I think it is normal.