Out-of-order search - Githubissues

orlade commented 9 years ago

I had a play with string_score, and while it's excellent for what it does, it doesn't quite fit my use case out of the box. In particular, it lacks recognition for highly similar substrings that are in a different order.

tl;rd: Do you plan to add support for out-of-order substring matching? Or can you at least think of a smart way to do it? If this is totally outside of the scope of string_score, then go ahead and close this. I'm mostly just rubber ducking the problem.

For background, I have two spreadsheets where each row represents a building with some attributes, including ID and name. I'm told that both spreadsheets contain the same 66 buildings, except that one of the spreadsheets has 72 rows, and neither of them use the same IDs or names consistently. One will abbreviate some names, the other will abbreviate others, or the same ones but in a different way. It's a mess, so I'd like an automated, objective mechanism for associating the "matching" rows and ultimately merging the attributes.

For example, when searching for a match for 2G8 Bahagian Pinjaman Perumahan, string_score with 0.5 fuzziness thinks that PMO is a better match than LOT 2G8 (2M10 & 2M11) Bhg. Pinjaman Perumahan, JPM. Or for a more English example, comparing university of oxford with oxford of university scores 0.027.

To address this failure mode, I've wrapped it in a pretty gnarly loop:

Both the search string and the comparison string are split into words
Each word of the search string is string_score'd against each word of the comparison string (concatenating the whole comparison string as an option for matching abbreviations)
For each search word, the score for the most similar comparison word is recorded
The score for the search string against the comparison string is taken as the sum of the (maximum) score of each word, normalised by dividing by the number of search words.

Clearly this is more expensive (something like an order of magnitude, or at least a factor of the average number of words per string), but it's pretty easy to implement given what string_score already does. Can you think of a straightforward way to modify your algorithm to handle this kind of case? Or even just a smarter way to package it than mine?

joshaven commented 9 years ago

Thanks for your interest in my code, it is an honor to know that people are still finding it useful.

Would it work to split the out-of-order text into segments and sum their scores?

var agScore = 0, segments = "World Hello".split(' ');
segments.forEach(function (word) { agScore = agScore + "Hello World".score(word); });
console.log('The aggregate score of all word segments is: ' + agScore);

I am super busy so I am afraid I cannot answer how feasible it is to modify the algorithm to support better solution at the moment.

If you would like to see this feature builtin to some degree can you describe the desired results in the test language? See the following for examples: https://github.com/joshaven/string_score/blob/master/tests/test_string_score.js

orlade commented 9 years ago

No problem, I hardly expect you to go and add features, just wondering if you had any brainwaves.

My previous example would be something like ok('University of Melbourne'.score('Melbourne Uni') < 'University of Melbourne'.score('Melbourne University'), 'Better out-of-order matches still score better than worse out-of-order matches'); (this passes with your method).

I think your method works in a basic sense, but the score needs to be penalised/normalised so that adding in additional unmatched words will be less good, i.e. ok('hello world'.score('world foo bar hello') < 'hello world'.score('world hello'), 'Unmatched words should be penalised');.

But normalising has to be a little intelligent such that ok('University of Melbourne'.score('Melbourne University') > 'University of Melbourne'.score('Unimelb'), 'Multiple out-of-order better than one wrong'); and ok('University of Melbourne'.score('Melbourne University') > 'University of Melbourne'.score('Uni versity of Melb ourne'), 'Fewer but more accurate words better than more close but incorrect fragments ');.

joshaven / string_score

Out-of-order search #24