Open ChrisHowlin opened 8 years ago
@ChrisHowlin and what seems to be the solution for this issue in order to obtain the same results as the book ??!!!
Do you guys know, why i have a different results than in the book? why is this happening?
@NumberOne925 I got the same result as yours. I think it is normal.
In Chapter 3 we construct a spam filter based on the data in the folder:
ML_for_Hackers/03-Classification/data/spam
In the book, the terms in these emails are ordered by occurrence with the command below. The book lists the following table with html at the top:
head(spam.df[with(spam.df, order(-occurrence)),])
When running the code directly, this does not match the output I get with email at the top:
This seems to be explained by the way the document vectors are processed with the
removePunctuation
setting. This punctuation is removed and any terms which were separated would now be a new term. For example, becomes htmlhead. The result is that instead of html being listed as a common term in many of the emails, we have lots of low frequency combination of html with other HTML tag keywords.