Closed bdon closed 2 years ago
Test script to find fnv1a collisions:
package main
import (
"bytes"
"hash/fnv"
"log"
"os"
"zombiezen.com/go/sqlite"
)
func main() {
conn, err := sqlite.OpenConn(os.Args[1], sqlite.OpenReadOnly)
if err != nil {
log.Fatal(err)
}
defer conn.Close()
hsh_to_len := make(map[uint64]int)
{
stmt, _, err := conn.PrepareTransient("SELECT tile_data FROM tiles")
if err != nil {
log.Fatal(err)
}
defer stmt.Finalize()
for {
row, err := stmt.Step()
if err != nil {
log.Fatal(err)
}
if !row {
break
}
reader := stmt.ColumnReader(0)
buf := new(bytes.Buffer)
buf.ReadFrom(reader)
data := buf.Bytes()
hsh := fnv.New64a()
hsh.Write(data)
sum := hsh.Sum64()
if val, ok := hsh_to_len[sum]; ok {
if val != len(data) {
log.Fatal("error, length mismatch", sum, val, len(data))
}
} else {
hsh_to_len[sum] = len(data)
}
}
}
}
https://stackoverflow.com/questions/22029012/probability-of-64bit-hash-code-collisions
Given that the probability of a 64-bit collision with 4 billion unique tiles is 0.5 (assuming a perfectly distributed hash), and worst-case usage can have ~400 million unique tiles (imagine a tileset zooms 0-15 where every tile is unique), it would make sense to use 128-bit hashes if the additional cost is reasonable
Running the deduplication check over a big archive with 64 bit fnv1a takes 9 minutes; with 128 bit fnv1a it takes 14 minutes.
Memory costs double of course
would it make sense to compare hash + length, and treat that as a unique-enough content?
@nyurik you will need 32 bits for the length, so that's not that different from just using a 96 bit hash; in practice not much different than 128
(edited as summary) The current Python
pmtiles-convert
program deduplicates by hashing each using thehash
builtin: https://docs.python.org/3/library/functions.html#hash which produces a 64-bit value on 64-bit systems.A typical basemap use case has about 40 million unique tile contents for zooms 0-14; zoom 15 has 100-200 million unique tile contents.
See this on Hash Collision Probabilities for formulas.
A collision between two hash values is unacceptable because that means the PMTiles archive will replace the collided tile with the wrong contents.
The approximate "birthday paradox" probability of a collision giving a (perfectly distributed) hash function of possible values
n
where there arek
values being hashed is:k^2 / 2n
.The probability of collisions in these cases is unacceptably high for the general use case of 100m+ tiles. A general-purpose deduplication implementation should use 128 bit hashes.