Open oadam opened 10 years ago
This puts unnecessary constraints on the Blum filter hash functions. Instead : Scan the whole dictionary Figure out good hash function candidates to create a 32 bit bloom filter Hash the input and iterate over the int32[wordCount] slice while xor-ing. When xor is happy perform the full check. Return the first 100 words
Filter should return true for a word and all its super multisets. The only kind of function that behave like this are function of the kind "w is a super multiset of k". Thus the problem is to find 32 nice multisets to build our hash functions
A good way would be to add these multiset 1 by 1. New multisets are found by a simulated annealing that maximizes the entropy of the hashes of the whole dictionary once the new filter is added. The current candidate is mutated by either a deletion or an addition (add more frequent letters more often) and mutated candidate is conserved or discarded based on its score and the current temperature of the annealing
Words should be seen as multisets of runes. Iterate a first time on dictionary to build the multi sets. At the same time keep track off how many words contain 3 "a" and more and how many words contain 1 "e" and more. We should notice that containing 4 z is very rare. Then build an array from longest word to smallest word that we will fullscan at each query. This should be an array of small structs to avoid cache misses. In the small strict put the word , a pointer to the multi set and an array of the N rarest features of this word. The objective is to discard it as fast as possible without iterating through the whole multiset.