noushadali / cleartk

Automatically exported from code.google.com/p/cleartk
0 stars 0 forks source link

TF-IDF uses integer division for calculating IDF #396

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
The method TfidfExtractor.IDFMap.getIDF(String) calculates IDF with the 
following:

int df = this.getDF(term);
return Math.log((this.totalDocumentCount + 1) / (df + 1));

The expression "(this.totalDocumentCount + 1) / (df + 1)" is actually integer 
division.  This is probably fine for most words that appear in a relatively 
small percentage of the documents but it won't provide a very smooth curve for 
high frequency words.  For example, if you have 50,000 documents and you have a 
word that occurs in 25,000 documents, then the idf will be log(2) as will a 
word that occurs in 18,000 documents (because 50,000 / 18,000 = 2).  

Now that I actually write this up, it seems clear that it is a bug.  I noticed 
this because this integer math makes it difficult to get a non-zero value for a 
simple unit test - but initially thought this wasn't a very good use case to 
make the change.  

I propose casting one of the operands to a double to fix this.  

Original issue reported on code.google.com by phi...@ogren.info on 8 Dec 2013 at 5:26

GoogleCodeExporter commented 9 years ago
I will also note that words that appear in every document will get an IDF of 0. 
 This seems a bit harsh and seems a likely problem for calculating TF-IDF on 
small corpora in which it may be very likely that high frequency words such as 
"the" will appear in every document.  Should "the" have zero tf-idf?  I doubt 
it.  

An obvious fix is to add 2 to the numerator instead of 1.  That may not be very 
principled fix - but it is easy and it works!

Original comment by phi...@ogren.info on 8 Dec 2013 at 5:36

GoogleCodeExporter commented 9 years ago
+1 for getting rid of the integer division bug
-1 for redefining TF-IDF to use +2 instead of +1

I think we want to be following the standard definition (e.g. 
http://en.wikipedia.org/wiki/Tf%E2%80%93idf) as much as possible. Following 
that logic, why are we adding 1 to the total document count. If it's the fear 
of the total document count being zero, shouldn't trying to calculate TF-IDF 
over zero documents throw an exception rather than failing silently?

Original comment by steven.b...@gmail.com on 8 Dec 2013 at 10:46

GoogleCodeExporter commented 9 years ago
Well, I think it is typical of add-one smoothing.  The reason we add one to the 
denominator (the number of documents containing the word) is so that words that 
have never been seen won't cause a NaN.  If you add one to the denominator but 
not the numerator, then you can end up with an IDF that is negative if a word 
appears in every document.  It seems unacceptable to emit a negative IDF value. 
 However, its not much better to get a zero value IDF for words that appear in 
every document.  Adding 2 to the numerator seems like the easiest way to make 
sure we always get a positive IDF score.  I imagine one could write a master's 
thesis reviewing more sophisticated smoothing approaches.  

Here's an equally simplistic/naive approach that might be less invasive and 
closer to wikipedia.  We could simply handle the edge cases for document 
frequency as exceptions - i.e. if a the document frequency of a word is 0, then 
add one and if it is equal to the total document count, then subtract one.  

Original comment by phi...@ogren.info on 9 Dec 2013 at 4:05

GoogleCodeExporter commented 9 years ago
I see your point about the negative IDF and agree that would be bad.

I don't feel super strongly about this (other than whatever path we choose, 
there should be detailed comments explaining why it's necessary). So feel free 
to either special-case the exceptions, or use your +2 approach.

Original comment by steven.b...@gmail.com on 9 Dec 2013 at 4:14

GoogleCodeExporter commented 9 years ago
Ok - I think I will stick to the approach that adds 2 to the numerator and 1 to 
the denominator.  I don't like the edge case approach because then you will get 
the same IDF for words that appear once as you will get for unseen words.  I 
will document it carefully.  

Original comment by phi...@ogren.info on 9 Dec 2013 at 4:23

GoogleCodeExporter commented 9 years ago
This issue was closed by revision 0e89195604fe.

Original comment by phi...@ogren.info on 9 Dec 2013 at 4:43