mkdynamic / vss

Simple vector space search engine
http://madeofcode.com/posts/69-vss-a-vector-space-search-engine-in-ruby
MIT License
14 stars 1 forks source link

Issues with tf*idf implementation #2

Open jpmckinney opened 11 years ago

jpmckinney commented 11 years ago

Your implementation uses plain term and document frequencies, with no damping or normalization (which, as far as I can tell, never occurs in the academic literature) . My gem, on the other hand, uses the same formula as Lucene and other major implementations. See https://github.com/opennorth/tf-idf-similarity

mkdynamic commented 11 years ago

Thanks for flagging. Since I don't use the gem anymore myself, I'm unlikely to invest the time to address this anytime soon. However I'd welcome a patch :)

Could you expand briefly on the impact of normalization and damping?

jpmckinney commented 11 years ago

I'd invite you to read about tf*idf implementations (there are links to references in my gem's README), but briefly, if you perform a similarity search without any normalization, you will have too strong a bias towards:

As for your calculation of IDF, every reference I've found takes the log of that term, so I'm not quite sure how you came to your implementation.