Closed GoogleCodeExporter closed 9 years ago
I will also note that words that appear in every document will get an IDF of 0.
This seems a bit harsh and seems a likely problem for calculating TF-IDF on
small corpora in which it may be very likely that high frequency words such as
"the" will appear in every document. Should "the" have zero tf-idf? I doubt
it.
An obvious fix is to add 2 to the numerator instead of 1. That may not be very
principled fix - but it is easy and it works!
Original comment by phi...@ogren.info
on 8 Dec 2013 at 5:36
+1 for getting rid of the integer division bug
-1 for redefining TF-IDF to use +2 instead of +1
I think we want to be following the standard definition (e.g.
http://en.wikipedia.org/wiki/Tf%E2%80%93idf) as much as possible. Following
that logic, why are we adding 1 to the total document count. If it's the fear
of the total document count being zero, shouldn't trying to calculate TF-IDF
over zero documents throw an exception rather than failing silently?
Original comment by steven.b...@gmail.com
on 8 Dec 2013 at 10:46
Well, I think it is typical of add-one smoothing. The reason we add one to the
denominator (the number of documents containing the word) is so that words that
have never been seen won't cause a NaN. If you add one to the denominator but
not the numerator, then you can end up with an IDF that is negative if a word
appears in every document. It seems unacceptable to emit a negative IDF value.
However, its not much better to get a zero value IDF for words that appear in
every document. Adding 2 to the numerator seems like the easiest way to make
sure we always get a positive IDF score. I imagine one could write a master's
thesis reviewing more sophisticated smoothing approaches.
Here's an equally simplistic/naive approach that might be less invasive and
closer to wikipedia. We could simply handle the edge cases for document
frequency as exceptions - i.e. if a the document frequency of a word is 0, then
add one and if it is equal to the total document count, then subtract one.
Original comment by phi...@ogren.info
on 9 Dec 2013 at 4:05
I see your point about the negative IDF and agree that would be bad.
I don't feel super strongly about this (other than whatever path we choose,
there should be detailed comments explaining why it's necessary). So feel free
to either special-case the exceptions, or use your +2 approach.
Original comment by steven.b...@gmail.com
on 9 Dec 2013 at 4:14
Ok - I think I will stick to the approach that adds 2 to the numerator and 1 to
the denominator. I don't like the edge case approach because then you will get
the same IDF for words that appear once as you will get for unseen words. I
will document it carefully.
Original comment by phi...@ogren.info
on 9 Dec 2013 at 4:23
This issue was closed by revision 0e89195604fe.
Original comment by phi...@ogren.info
on 9 Dec 2013 at 4:43
Original issue reported on code.google.com by
phi...@ogren.info
on 8 Dec 2013 at 5:26