mpkorstanje / simmetrics

Similarity or Distance Metrics, e.g. Levenshtein, for Java
Apache License 2.0
41 stars 15 forks source link

More information required about the different metrics #15

Open ijabz opened 7 years ago

ijabz commented 7 years ago

HI, sorry not really an issue but I have raised a simmetrics question on http://stackoverflow.com/questions/40740577/should-i-use-stringmetric-or-multisetmetric-for-comparing-these-strings-with-sim that I hope you can me help with

Having said that it would be helpful if there was a page that grouped/explained the metrics to allow casual users to have a better stab on using the right algorithm. For example I have only just realized that CosineSimilarity with WhiteSpace tokenizer just treats the words in a sentence as a set ignoring order in sentence, although happily this essentially is what I want it to do

mpkorstanje commented 7 years ago

Unfortunately the right metric depends strongly on the domain it is used in.

I suppose a grouping could be made a on properties like influenced by repetition but this should be implied by their nature as a set or multiset metric. Likewise list metrics are influenced by order and repetition. However the implications of these properties are often not intuitively clear.

So the best general advice I could give is to compare the different combinations of metrics, simplifiers and optionally tokenizers for their precision and recall in your domain.