lichess-org / lila-openingexplorer

Opening explorer for lichess.org that can handle all the variants and trillions of unique positions
http://lichess.org/analysis#explorer
GNU Affero General Public License v3.0
135 stars 34 forks source link

filter duplicates #8

Closed niklasf closed 8 years ago

niklasf commented 8 years ago

We could use something like a bloom filter on the game id to detect duplicates. Probably this can't happen anyway, though.

niklasf commented 8 years ago

The parameter estimation of https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/util/BloomFilter.scala might be broken, as it is overloaded to early. Investigate.

Edit: Or the hashing function is too bad, as it relies on ##.

ornicar commented 8 years ago

why not just ask kyoto if the id exists?

niklasf commented 8 years ago

We don't have the data in Kyoto (yet). The closest thing would be checking if the most specific (last stored) position already exists, but many high quality games do not branch enough to be unique at our current MAX_PLIES. Would be a shame to specifically penalize games that follow theory very deeply.

niklasf commented 8 years ago

ornicar: have you considered http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/hash/BloomFilter.html ornicar: see that too https://github.com/Baqend/Orestes-Bloomfilter

niklasf commented 8 years ago

I like https://github.com/Baqend/Orestes-Bloomfilter in interface as well as implementation.

Compare:

Round 140000000: 2269445 false positives (0.016210321428571428) | breeze, best possible due too bad hashing
Round 140000000: 16839 false positives (1.2027857142857143E-4)  | Orestes, tuned down to target probability 0.001
ornicar commented 8 years ago

Great!