Open ijabz opened 7 years ago
Unfortunately the right metric depends strongly on the domain it is used in.
I suppose a grouping could be made a on properties like influenced by repetition but this should be implied by their nature as a set or multiset metric. Likewise list metrics are influenced by order and repetition. However the implications of these properties are often not intuitively clear.
So the best general advice I could give is to compare the different combinations of metrics, simplifiers and optionally tokenizers for their precision and recall in your domain.
HI, sorry not really an issue but I have raised a simmetrics question on http://stackoverflow.com/questions/40740577/should-i-use-stringmetric-or-multisetmetric-for-comparing-these-strings-with-sim that I hope you can me help with
Having said that it would be helpful if there was a page that grouped/explained the metrics to allow casual users to have a better stab on using the right algorithm. For example I have only just realized that CosineSimilarity with WhiteSpace tokenizer just treats the words in a sentence as a set ignoring order in sentence, although happily this essentially is what I want it to do