sts10 / tidy

Combine and clean word lists
https://sts10.github.io/2021/12/09/tidy-0-2-0.html
MIT License
80 stars 4 forks source link

Implement algorithm to guarantee list in uniquely decodable while cutting fewest number of words #13

Closed sts10 closed 2 years ago

sts10 commented 2 years ago

Removing all prefix words and removing all suffix words guarantees that the resulting list will be what's called a uniquely decodable code.

However, it seems very possible that there are other procedures to take on a given list to guarantee that it becomes a uniquely decodable code but removes fewer words than cutting all prefix or suffix words would. What would such a procedure look like?

One clue is the Sardinas–Patterson algorithm. I've been trying to implement Sardinas-Patterson in Rust, but haven't quite grokked it yet.

Note that Sardinas-Patterson seems to only provide a true/false answer if a given code is uniquely decodable. While that boolean would be nice to add to the list of attributes Tidy can display about it a list, we'd ideally also have a procedure for removing the fewest number of words to create a uniquely decodable code.

Any insight into these questions are welcome, either here or on the other repo.

More potentially useful links

bwbug commented 2 years ago

I'm wondering if the Kraft-MacMillan inequality may be useful? It should at least be easy to calculate!

Leaving this link for future reference, since it seemed to contain some interesting ideas: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.9108&rep=rep1&type=pdf

sts10 commented 2 years ago

Oooo didn't know about this one. Love learning new theorems. Definitely easier to implement (I did it in the Rust Playground first).

Interestingly, it evaluates to true even for word lists that have prefix words in it (like the 2021 1Password word list).

Attributes of new list
----------------------
List length               : 18176 words
Mean word length          : 6.20 characters
Length of shortest word   : 3 characters (abe)
Length of longest word    : 8 characters (zwieback)
Free of prefix words      : false
Free of suffix words      : false
Entropy per word          : 14.150 bits
Efficiency per character  : 2.283 bits
Assumed entropy per char  : 4.717 bits
Above brute force line    : false
Above Shannon line        : false
Shortest edit distance    : 1
Mean edit distance        : 6.168
Longest shared prefix     : 7
Unique character prefix   : 8
McMillan Inequality       : true

In other words, 0.03442809288440331 <= 1.

I think this makes sense. I gather that the theory says any uniquely decodable code will satisfy the inequality. It does not say that all non-uniquely decodable codes will not satisfy the inequality.

So really the inequality would only tell us if a list is definitely not uniquely decodable (if it proves false). (I wonder if, given the long lengths of the word lists we deal with, it would ever evaluate to false...) I guess that's kind of useful? But not quite the silver bullet I'm looking for.

bwbug commented 2 years ago

Yea, I'm still trying to figure out exactly what the theorem says, and how to apply it correctly! It is a necessary and sufficient condition for the existence of a uniquely decodable or prefix code (with the same distribution of word lengths as the code you tested), but it doesn't guarantee that a code passing the test actually is a prefix code or uniquely decodable.

Note that the alphabet size that appears in the summation should be the number of distinct characters actually present in the word list, not in the character set from which the words were generated, or you will get misleading results. For example, for a code {a,b,ab}, if you take the alphabet to contain 26 letters, you would get

26-1 + 26-1 + 26-2 = 0.0784 < 1

but if you take the alphabet to contain only 2 letters, you get

2-1 + 2-1 + 2-2 = 1.25 > 1

So the conclusions that can be drawn are threefold:

  1. The actual code {a,b,ab} is not a prefix code.
  2. It is impossible to make a prefix code that has a length distribution {1,1,2} using any 2-letter alphabet.
  3. It is possible to make a prefix code that has a length distribution {1,1,2} using the full 26-letter alphabet (e.g., {c,g,mx}).

There may be some benefit to quickly computing the K-M inequality before jumping into the S-P algo: in case the sum exceeds 1, it tells us that the code is not uniquely decodable, without having to evaluate the S-P algorithm. However, if the sum does not exceed 1, then the result is inconclusive, and you have to go to the S-P algo to determine if the code is uniquely decodable.

The K-M inequality also provides some guidance for cases in which the sum does exceed 1. In those situations, you get most bang for your buck by eliminating the shortest prefixes first.

bwbug commented 2 years ago

Note that the alphabet size that appears in the summation should be the number of distinct characters actually present in the word list

OK, I just saw in your playground code that you are doing this:

let B = count_unqiue_characters(list);

sts10 commented 2 years ago

OK, big step today: We now have a "Uniquely decodable?" boolean in the Attributes that Tidy prints.

I did this by implementing a version of the Sardinas-Patterson algorithm.

The work is located in the new uniquely_decodable.rs file. Huge thanks to @danhales and his excellent Jupyter Notebook on the subject, which I followed literally line-by-line.

I definitely need to write tests for my function and eventually speed it up, but I'm happy with it for now.

The second half of our project here is to come up with some sort of procedure for preserving the most number of words while ensuring a word list in uniquely decodable. Unfortunately, the process of implementing Sardinas-Patterson didn't offer me any Eureka moments about that.

bwbug commented 2 years ago

Congrats, that's great! Now I feel a little bad about the issue I just posted in your other repo...

sts10 commented 2 years ago

Got a naive approach to the pruning algorithm in place. It basically uses the Sardinas-Patterson algorithm, but instead of the final output being a boolean of whether the inputted list is uniquely decodable or not, it returns a list of what I'm calling the "final intersection" of code words.

The reason we're interested in this "final intersection" is that if there are no final intersections, the list is proved to be uniquely decodable. So all I did was write a function that finds the words in this final intersection and then removes them from the inputted list.

I'm pretty excited about it, so I named it after myself!

fn schlinkert_prune(list: &[String]) -> Vec<String> {
    let offenders_to_remove = get_sardinas_patterson_final_intersection(&list);
    let mut new_list = list.to_owned();
    new_list.retain(|x| !offenders_to_remove.contains(x));
    new_list
}

How it stacks up so far:

On the 1Password 2021 list

On my "basic" word list

So far so good!

Room for improvement

I think my current approach is a bit naive, in that it waits till the very end of the Sardinas-Patterson process to give a list of words to remove. I think it if somehow traced words back to the original "offenders", it might be able to save even more words. Just a guess though.

Blog post

I also wrote a blog post about this feature.