Open orlade opened 9 years ago
Thanks for your interest in my code, it is an honor to know that people are still finding it useful.
Would it work to split the out-of-order text into segments and sum their scores?
var agScore = 0, segments = "World Hello".split(' ');
segments.forEach(function (word) { agScore = agScore + "Hello World".score(word); });
console.log('The aggregate score of all word segments is: ' + agScore);
I am super busy so I am afraid I cannot answer how feasible it is to modify the algorithm to support better solution at the moment.
If you would like to see this feature builtin to some degree can you describe the desired results in the test language? See the following for examples: https://github.com/joshaven/string_score/blob/master/tests/test_string_score.js
No problem, I hardly expect you to go and add features, just wondering if you had any brainwaves.
My previous example would be something like ok('University of Melbourne'.score('Melbourne Uni') < 'University of Melbourne'.score('Melbourne University'), 'Better out-of-order matches still score better than worse out-of-order matches');
(this passes with your method).
I think your method works in a basic sense, but the score needs to be penalised/normalised so that adding in additional unmatched words will be less good, i.e. ok('hello world'.score('world foo bar hello') < 'hello world'.score('world hello'), 'Unmatched words should be penalised');
.
But normalising has to be a little intelligent such that ok('University of Melbourne'.score('Melbourne University') > 'University of Melbourne'.score('Unimelb'), 'Multiple out-of-order better than one wrong');
and ok('University of Melbourne'.score('Melbourne University') > 'University of Melbourne'.score('Uni versity of Melb ourne'), 'Fewer but more accurate words better than more close but incorrect fragments ');
.
I had a play with string_score, and while it's excellent for what it does, it doesn't quite fit my use case out of the box. In particular, it lacks recognition for highly similar substrings that are in a different order.
tl;rd: Do you plan to add support for out-of-order substring matching? Or can you at least think of a smart way to do it? If this is totally outside of the scope of string_score, then go ahead and close this. I'm mostly just rubber ducking the problem.
For background, I have two spreadsheets where each row represents a building with some attributes, including ID and name. I'm told that both spreadsheets contain the same 66 buildings, except that one of the spreadsheets has 72 rows, and neither of them use the same IDs or names consistently. One will abbreviate some names, the other will abbreviate others, or the same ones but in a different way. It's a mess, so I'd like an automated, objective mechanism for associating the "matching" rows and ultimately merging the attributes.
For example, when searching for a match for
2G8 Bahagian Pinjaman Perumahan
, string_score with 0.5 fuzziness thinks thatPMO
is a better match thanLOT 2G8 (2M10 & 2M11) Bhg. Pinjaman Perumahan, JPM
. Or for a more English example, comparinguniversity of oxford
withoxford of university
scores 0.027.To address this failure mode, I've wrapped it in a pretty gnarly loop:
Clearly this is more expensive (something like an order of magnitude, or at least a factor of the average number of words per string), but it's pretty easy to implement given what string_score already does. Can you think of a straightforward way to modify your algorithm to handle this kind of case? Or even just a smarter way to package it than mine?