mhutter / string-similarity

Calculate String Similarities
MIT License
90 stars 8 forks source link

Wrong cosine similarity for urls #6

Closed kalemi19 closed 5 years ago

kalemi19 commented 5 years ago

Not sure if this gem is still maintained, but the returned cosine similarity for the following two urls is 97%

https://maduradas.com/pena-ajena-la-ridicula-actuacion-estos-gaiteros-chavistas-programa-diosdado-video/

https://maduradas.com/sepalo-ortega-diaz-afirmo-globovision-la-vitalicia-deberan-subastadas-al-restituirse-la-democracia/

Even by looking at the urls you can tell they're far from being the same.

Looking at this online tool https://asecuritysite.com/forensics/simstring, the cosine similarity should be 0

Meanwhile, I'm looking at the underlying algorithm.

mhutter commented 5 years ago

@kalemi19, the tool you linked to reports a cosine similarity for all inputs. So I'm pretty confident my implementation is the correct one ;-)

mhutter commented 5 years ago

Correction, the tool you linked compares words while string similarity algorithms usually compare characters

kalemi19 commented 5 years ago

Got it. Makes sense now. Thank you.