Open jpmckinney opened 11 years ago
Thanks for flagging. Since I don't use the gem anymore myself, I'm unlikely to invest the time to address this anytime soon. However I'd welcome a patch :)
Could you expand briefly on the impact of normalization and damping?
I'd invite you to read about tf*idf implementations (there are links to references in my gem's README), but briefly, if you perform a similarity search without any normalization, you will have too strong a bias towards:
As for your calculation of IDF, every reference I've found takes the log of that term, so I'm not quite sure how you came to your implementation.
Your implementation uses plain term and document frequencies, with no damping or normalization (which, as far as I can tell, never occurs in the academic literature) . My gem, on the other hand, uses the same formula as Lucene and other major implementations. See https://github.com/opennorth/tf-idf-similarity