smassung / text-data-book-comments

Comments, errata, suggestions, and issues for the book "Text Data Analysis and Management: A Practical Introduction to Text Mining and Information Retrieval"
39 stars 15 forks source link

Chapter 12 equation p90 and general remarks #4

Open duf59 opened 9 years ago

duf59 commented 9 years ago

In section 12.3.2, at page 90, third line of the equation: = argmax(y, x1, ..., xn) Shouldn't the first comma be replaced by a dot ? i.e. = argmax(y.x1, ..., xn).

I also have a suggestion regarding section 12.3.3 which introduces linear classifier. I would recommend adding as a reference "An Introduction to Statistical Learning" which has a very nice explanation of support vector classifiers (see here page 344). It describes the maximum margin classifier you plot in figure 12.3 and then introduces soft margins and kernel based SVM.

From a general viewpoint I find the chapter nicely written and very clear. However I already knew most of the concept and I realize that for people not familiar with statistical learning methods, there are not many references for them to dig deeper. Some references to basic textbooks would help (for kNN, naive Bayes, cross-validation procedures and SVM). A last point is that the section about Evaluation of Text Categorization (12.4) is quite short, do you plan to introduce some elementary measures like false/true positive/negative, roc curve, etc. , ? (or maybe some reference describing in more detail measures that can be applied to contingency tables in the context of text retrieval or more globally)

smassung commented 9 years ago

Thanks for the feedback!

For the Naive Bayes catch, yes you are definitely right. I forgot to pull p(y) out of that statement somehow. I've fixed it and will release an updated version.

I also agree with the comment on more references for the linear classifier part. Your reference is a good one, and I'll keep it in mind. In general, the book is currently very light on references. When we do our second draft pass, we will be adding many more. For now, it's mostly an attempt to get some content in the pages!

Since we are releasing the draft chapters out of order, you will miss a little of the context. There will be a chapter called Search Engine Evaluation in the first half of the book which will introduce the bulk of the evaluation motivation and context. The evaluation section in the text categorization chapter is meant to touch on concepts not mentioned earlier. Though, you do bring up a good point in that we might want to retouch some of the other evaluation concepts for 1) reinforcing the concepts and 2) in case the previous chapter isn't read.

Again, thanks for the comments. I'll leave this issue open since your second two comments will require us to wait until we pass through the book again to further examine.

P.S. another missing part of this chapter is the Applications section (which should come after Evaluation). It also will be added later!