The result isn't a proper distance

per-erik commented 13 years ago

"Hello World".score("hello") is not equal to "hello".score("Hello World") and thus the result isn't symmetric (a nice property if one wants to start meassure string edit distances).

This is somewhat in connection to issue #10. (This algorithm offers fast execution while giving up on some mathematical properties that some might find nice to have.)

Note that the particular test case above works "as expected" if one uses a fuzziness of 1. That isn't a solution though since it would yield strange results for things like "foo".score("Hello World", 1).

Also, it would be nice if two equal strings gave the result 0 and two completely different strings gave the result 1. Again, to satisfy the distance property. (If the distance between NY and London is 0 then NY and London are the same location. If the distance between the strings "foo" and "bar" is 0, then they are the same strings.)

Finally, I have not yet tested if it would satisfy the triangle inequality (a.score(b) + b.score(c) >= a.score(c) or, if you go from location a to location b and then from b to c, you must have walked further than or equal to the distance from a to c), but my guess is not. This too is needed to be a proper meassurement of distance.

On the other hand, you never set out to create an algorithm to meassure string distances, only to score strings.

PS. I love the project and it fits perfectly for what I'm about to do. Just writing this to explain why people might ask what this solution has to offer over things like the Levenshtein Distance or Hamming Distance and to document to other users that some things will not be possible with this solution. For the interested, this would likely be an algorithm producing an ordinal scale (ranking scale) so you can always compare two results but you can't do basic arithmetic on it. See http://en.wikipedia.org/wiki/Level_of_measurement for examples of what this implies.

joshaven commented 13 years ago

Wow, I wish I had the time to re-work and rethink my score right now. I am totally swamped at work and the work has nothing to do with JavaScript at the moment. If you have any code recommendations then I would love to include anything into the project that makes the score better.

At the moment though, I will have to keep your email for further evaluation. I set out to get a fast score because the score I was using was too slow... it seems that I've stumbled into something that is more about math then about speed which was not at all what I was shooting at.

Thanks for your time and thoughtful response!

On Fri, Nov 4, 2011 at 10:08 PM, per-erik < reply@reply.github.com>wrote:

"Hello World".score("hello") is not equal to "hello".score("Hello World") and thus the result isn't symmetric (a nice property if one wants to start meassure string edit distances).

This is somewhat in connection to issue #10. (This algorithm offers fast execution while giving up on some mathematical properties that some might find nice to have.)

Note that the particular test case above works "as expected" if one uses a fuzziness of 1. That isn't a solution though since it would yield strange results for things like "foo".score("Hello World", 1).

Also, it would be nice if two equal strings gave the result 0 and two completely different strings gave the result 1. Again, to satisfy the distance property. (If the distance between NY and London is 0 then NY and London are the same location. If the distance between the strings "foo" and "bar" is 0, then they are the same strings.)

Finally, I have not yet tested if it would satisfy the triangle inequality (a.score(b) + b.score(c) >= a.score(c) or, if you go from location a to location b and then from b to c, you must have walked further than or equal to the distance from a to c), but my guess is not. This too is needed to be a proper meassurement of distance.

On the other hand, you never set out to create an algorithm to meassure string distances, only to score strings.

PS. I love the project and it fits perfectly for what I'm about to do. Just writing this to explain why people might ask what this solution has to offer over things like the Levenshtein Distance or Hamming Distance and to document to other users that some things will not be possible with this solution. For the interested, this would likely be an algorithm producing an ordinal scale (ranking scale) so you can always compare two results but you can't do basic arithmetic on it. See http://en.wikipedia.org/wiki/Level_of_measurement for examples of what this implies.

Reply to this email directly or view it on GitHub: https://github.com/joshaven/string_score/issues/15

Sincerely, Joshaven Potter

Among the sins to which the human heart is prone, hardly any other is more hateful to God than idolatry, for idolatry is at bottom a libel on His character. The idolatrous heart assumes that God is other than He is - in itself a monstrous sin - and substitutes for the true God one made after its own likeness. Always this God will conform to the image of the one who created it and will be base or pure, cruel or kind, according to the moral state of the mind from which it emerges. -- A. W. Tozer

aventuralabs commented 11 years ago

It seems that there's another embedded issue: if we remove the exact equality test at the start, equal strings do not receive a score of 1. i.e. "hello".score("hello") = 0.95.

I'm going to attempt to add round this out into a proper metric while maintaining the speed.

B

joshaven / string_score

The result isn't a proper distance #15