Chapter 3: Contents of spam.df don't match output in book

ChrisHowlin commented 8 years ago

In Chapter 3 we construct a spam filter based on the data in the folder:

ML_for_Hackers/03-Classification/data/spam

In the book, the terms in these emails are ordered by occurrence with the command below. The book lists the following table with html at the top:

head(spam.df[with(spam.df, order(-occurrence)),])

	term	frequency	density	occurrence
2122	html	377	0.005665595	0.338
538	body	324	0.004869105	0.298
4313	table	1182	0.017763217	0.284
1435	email	661	0.009933576	0.262
1736	font	867	0.013029365	0.262
1942	head	254	0.003817138	0.246

When running the code directly, this does not match the output I get with email at the top:

	term	frequency	density	occurrence
7781	email	813	0.005853680	0.566
18809	please	425	0.003060042	0.508
14720	list	409	0.002944840	0.444
27309	will	828	0.005961681	0.422
3060	body	379	0.002728837	0.408
9457	free	539	0.003880853	0.390

This seems to be explained by the way the document vectors are processed with the removePunctuation setting. This punctuation is removed and any terms which were separated would now be a new term. For example, becomes htmlhead. The result is that instead of html being listed as a common term in many of the emails, we have lots of low frequency combination of html with other HTML tag keywords.

IbrahimZamit commented 8 years ago

@ChrisHowlin and what seems to be the solution for this issue in order to obtain the same results as the book ??!!!

NumberOne925 commented 6 years ago

Do you guys know, why i have a different results than in the book? why is this happening? data mining1

pythonandr commented 6 years ago

@NumberOne925 I got the same result as yours. I think it is normal.

johnmyleswhite / ML_for_Hackers

Chapter 3: Contents of spam.df don't match output in book #35