larryhastings / correlate

A clever brute-force correlator for kinda-messy data
Other
82 stars 1 forks source link

Add a score boost based on the order of keys #5

Open larryhastings opened 2 months ago

larryhastings commented 2 months ago

I want correlate to start paying attention to the order of "keys" added. Consider this input:

dataset A:

"A" "big" "wonderful" "mouse" -> value X

dataset B:

"Let's" "get" "that" "big" "mouse" -> value Y

"mouse" "should" "never" "get" "big" -> value Z

There are two possible matches, X->Y and X->Z. I've contrived them so that they'd get the same score using the current version of the heuristic--currently "correlate" has no way of preferring one of these matches over the other. I claim that X->Y is a slightly better match than X->Z, because "big" should come before "mouse".

The idea is: remember the order of keys, and add a score boost based on this formula:

                             (length of longest actual common subsequence) ** 3
--------------------------------------------------------------------------------------------------------
(length of longest possible common subseqence in A) * (length of longest possible common subseqence in B)

Taking our example above, the longest possible subsequence for X is 4, for Y is 5, and for Z is 5. The longest actual subsequence in common for X->Y is 2, and for X->Z is 1. So the boost for X->Y would be (2 * 3) / (4 5), or 0.4, and the boost for X->Z would be (1 ** 3) / (4/5), or 0.05.

(Why cubed? It's (score / A) * (score / B) * score. If you don't add that third score, then getting a 3 word run out of a possible 5 scores more highly than a 10 word run out of a possible 20. But getting a ten word run is way more impressive, and therefore a much stronger signal! Maybe it should be even higher, like length of actual ** 3.5! maybe it should be tunable!)


Note that we could add support for this without changing the API! We change the semantics of dataset.set_keys() so that the order of the keys is significant, and retained.


What follows below this are notes on a possible algorithm for determining the "length of the longest actual subsequence". I think it might work!


This appears to be an easier version of the "longest common subsequence" problem:

https://en.wikipedia.org/wiki/Longest_common_subsequence

Why is it easier? I can guarantee every word of the two sequences appears a maximum of one time. I believe that makes it much easier--and I've come up with a cheap algorithm that I think works fine for this version.


Imagine you have two non-empty lists, X and Y. Each list is a sequence of arbitrary words. Each list only contains a particular word once; if the first entry in list X is the word "nickel", that's guaranteed to be the only instance of the word "nickel" in list X. (To save typing--and reading--I'll represent words as individual letters in my examples below.)

Your goal is to remove words from each list until X and Y are exactly the same, while keeping X and Y as long as possible. I think of this as the "score"; the length of the resulting lists is the "score", and naturally we want to maximize our "score". Two solutions with the same "score" are equivalent for my purposes.

Here's my "first example". Let's say list X is [ M A B C D E ], and list Y is [ C D N E A B ]. The best option is to remove A and B from both lists, remove M from list X, and remove N from list Y. Now both lists are [ C D E ], which is length 3; this is the best possible result for these inputs.

You're not allowed to rearrange either list--you can only remove entries. For example, if list X is [ A B C D E ] and list Y is [ E D C B A ], the best you can do is a list of length 1, and there are five possible (fungible) results. If list X is [ A B C D E ] and list Y is [ D C E A B], the best you can do is a list of length 2, and there are three possible (fungible) results: [ A B ], [ C E ], and [ D E ].

Here's the best algorithm I've come up with so far. Note that, as with most modern programming languages, I define the first entry in a list as at index 0. (Pascal is the only language I know where the first entry is index 1!)

--

First, create an empty "result" list.

Iterate over each list and throw away every word that doesn't appear in the other list. (If a word only appears in one of the two lists, it can't be part of the solution.) Assertion: the two lists must now be the same length.

Next, for every entry in X and Y, compute the index of the appearance of that word in the other list, and store those values in order in two new lists. We'll call these two new lists "other-list-index for X" and "other-list-index for Y". If you like hard-to-read code: for every value R, where R < len(X), " X[R] == Y[other-list-index for X[R]] " must be true, and " Y[R] == X[other-list-index for Y[R]] " must also be true.

Now you loop:

First, are either of the two lists empty? If so, stop. (Or, "break".) "result" contains the longest common subsequence, and the length of "result" is the score of this solution.

Next, examine the first entry in each of the two other-list-index lists. If the two entries are both 0, we're gonna "keep" at least one value. Iterate over both lists from the front and count how many values are identical, let's call that count K. Append the first K values from either list to "result". Discard the first K values from all four lists (X, Y, other-list-index for X, other-list-index for Y). Finally, iterate over both other-list-index lists and deduct K from every value. (Assertion: after deleting the K values from the front of all lists, every value in the other-list-index lists must be >= K.) Now start over at the top of the loop. (AKA, "continue".)

Else, the first entry in one of the two other-list-index lists is not zero. (Assertion: both are nonzero. If one value is zero, the other value must also be zero.) We must "discard" one word, removing it from both lists. First, compare the two numbers; if one is smaller, we "prefer" that list. (If the two numbers are equal, arbitrarily pick one list to prefer, it doesn't matter which.) Let's call the preferred list "list P", and the non-preferred list "list Q"; we're gonna surgically discard the first word from the non-preferred list, aka list Q. First, remember the first value in the "other-list-index for Q" list, let's call that number R. Now discard the first entry of Q, and the first entry of the "other-list-index for Q". Discard the Rth entry from list P, and the Rth entry from "other-list-index list for P". Iterate over the other-list-index list for list P and subtract 1 from every number. Iterate over the other-list-index list for list Q and subtract 1 from every number if it's > R. Then start over at the top of the loop (aka "continue").

--

Here's an example run-through, using the inputs from the "first example" above. We start with our inputs:

X: [ M A B C D E ] Y: [ C D N E A B ]

First we create our empty "result" list. Then we discard the unique words from both lists: M from list X, and N from list Y. Our values now look like:

result: [ ] X: [ A B C D E ] Y: [ C D E A B ]

Next we compute the other-list-index lists. For example, the first value in list X is A; the index for A in list Y is 3. So the first value in other-list-index for X is 3, and so on.

Our values now look like this:

result: [ ] X: [ A B C D E ] other-list-index for X: [ 3 4 0 1 2 ] Y: [ C D E A B ] other-list-index for Y: [ 2 3 4 0 1 ]

Now we start the loop.

Since neither list is empty, we don't stop. And it's not true that both other-list-index lists have a 0 in the first entry, so we don't "keep". We must "discard" a word from both lists.

The first entry in other-list-index for X is 3, the first entry in other-list-index for Y is 2, so we "prefer" list Y. List Y is our "list P", list X is our "list Q", and our value "R" is 3. We surgically remove the first word from list X, which is A. We discard the first entry in list X and other-list-index for X. We discard the entries at index 3 (aka "the fourth entries") in list Y and other-list-index for Y. We decrement every value in other-list-index for Y. We decrement every value in other-list-index for X if it's > 3.

Our values now look like:

result: [ ] X: [ B C D E ] other-list-index for X: [ 3 0 1 2 ] Y: [ C D E B ] other-list-index for Y: [ 1 2 3 0 ]

Neither list is empty so we don't stop. Neither other-list-index list starts with a 0, so we don't "keep". We "discard" again.

The first entry in other-list-index for X is 3, the first entry in other-list-index for Y is 1, so again we "prefer" list Y. List Y is our "list P", list X is our "list Q", and our value "R" is 3. We surgically remove the first entry from list X, which is B.

Our values now look like:

result: [ ] X: [ C D E ] other-list-index for X: [ 0 1 2 ] Y: [ C D E ] other-list-index for Y: [ 0 1 2 ]

Neither list is empty, so we continue. The two other-list-index lists both start with 0, so we "keep". We zip down both lists and find that the first 3 values of each list are the same. So our K is 3. We copy 3 values from either list to "result", we discard the first three values from each of the four lists, and we would remove 3 from every value from the other-list-index lists... but at this point they're both empty.

The lists now look like:

result: [ C D E ] X: [ ] other-list-index for X: [ ] Y: [ ] other-list-index for Y: [ ]

Both lists are empty, which satisfies "at least one list is empty" condition, so we stop. The result is length 3, which is the best result for these inputs. Success!

larryhastings commented 2 months ago

I'm starting to realize: nope, this is just the "longest common subsequence" problem. I thought I could guarantee all keys in each sequence would be unique (see "nickel" above), because each sequence would only contain keys from one particular round. But... why did I think that?

Consider that you might map the sequence of keys "the" "mouse" "with" "the" "big" "hat" "on" to a value. That has the key "the" twice--and what's wrong with that? Nothing.

On the other hand, I'm also realizing that I really don't need to do anything special with respect to "rounds". We've already accounted for the concept that the word "the" was mapped twice; it would be redundant to consider it here too. And if the sequence of keys "the" "mouse" "with" "the" "big" "hat" maps to value X in dataset A, and value Y in dataset B, so that it's common between the both of them, then by definition the word "the" was mapped to each of the two values in different rounds, etc etc. The score boost for sequences of keys simply doesn't need to deal with the concept of rounds at all.

larryhastings commented 2 months ago

Given the API, it's possible to make multiple sequences of keys to the same value. I don't think there's any way around it: for every match we consider, matching value X from dataset 1 to value Y from dataset 2, we need to try every subsequence of keys that map to X times every subsequence of keys that map to Y. If there are 3 sequences of keys that map to X, and 4 sequences of keys that map to Y, we need to try all 12 possible combinations.

Of course, obviously we throw away:

We try all the combinations, then we use the "greedy algorithm" (and maybe the "match boiler") to find any we want to keep.

BTW, if the two values in the match have multiple distinct longest common subsequences of keys in common, I think you use all of them. They all add to the final score for the match.

(By "distinct", I mean, you can't reuse sequences of keys. If one sequence of keys mapped to X has a nonzero score with two different sequences of keys mapped to Y, you can't use the sequence of keys mapped to X twice. But if three different sequences of keys mapped to X map appropriately to three different sequences of keys mapped to Y, you can use all of 'em.)