neizod / duppub

Detects duplicate publications
MIT License
0 stars 4 forks source link

better algorithm detecting similar string #1

Open neizod opened 4 years ago

neizod commented 4 years ago

right now i use edit distance to check similarity between 2 strings, it run so slow i had to limit string size. are there any better algorithm?

that i have in mind:

yildirimyigit commented 4 years ago

Why don't you use pyjarowinkler library? I can work on this issue if you want.

maniis commented 4 years ago

I am thinking to add Levenshtein distance which is much faster than this.

neizod commented 4 years ago

@yildirimyigit I did this project for my friend as a quick Friday night hack under a couple hours and without any knowledge about string matching or existing libraries. So improvements are welcome!

@maniis Thank you, I will take a look at it.

khinezarthwe commented 4 years ago

I added cosine similarity for faster similarity calculation and made a PR.