Hash function collisions / hash width for tile deduplication

bdon commented 2 years ago

(edited as summary) The current Python pmtiles-convert program deduplicates by hashing each using the hash builtin: https://docs.python.org/3/library/functions.html#hash which produces a 64-bit value on 64-bit systems.

A typical basemap use case has about 40 million unique tile contents for zooms 0-14; zoom 15 has 100-200 million unique tile contents.

See this on Hash Collision Probabilities for formulas.

A collision between two hash values is unacceptable because that means the PMTiles archive will replace the collided tile with the wrong contents.

The approximate "birthday paradox" probability of a collision giving a (perfectly distributed) hash function of possible values n where there are k values being hashed is: k^2 / 2n.

For 64-bit hashes and 40 million unique tiles, the probability of a collision is ~1 in 23,000
For 64-bit hashes and 200 million unique tiles, the probability is ~1 in 922

The probability of collisions in these cases is unacceptably high for the general use case of 100m+ tiles. A general-purpose deduplication implementation should use 128 bit hashes.

For 128-bit hashes and 40 million unique tiles, the probability of a collision is *1 in 4.25 10^21**
For 128-bit hashes and 200 million unique tiles, the probability is *1.70 10^20**

bdon commented 2 years ago

Test script to find fnv1a collisions:

package main

import (
    "bytes"
    "hash/fnv"
    "log"
    "os"
    "zombiezen.com/go/sqlite"
)

func main() {
    conn, err := sqlite.OpenConn(os.Args[1], sqlite.OpenReadOnly)
    if err != nil {
        log.Fatal(err)
    }
    defer conn.Close()

    hsh_to_len := make(map[uint64]int)

    {
        stmt, _, err := conn.PrepareTransient("SELECT tile_data FROM tiles")
        if err != nil {
            log.Fatal(err)
        }
        defer stmt.Finalize()

        for {
            row, err := stmt.Step()
            if err != nil {
                log.Fatal(err)
            }
            if !row {
                break
            }
            reader := stmt.ColumnReader(0)

            buf := new(bytes.Buffer)
            buf.ReadFrom(reader)
            data := buf.Bytes()

            hsh := fnv.New64a()
            hsh.Write(data)
            sum := hsh.Sum64()

            if val, ok := hsh_to_len[sum]; ok {
                if val != len(data) {
                    log.Fatal("error, length mismatch", sum, val, len(data))
                }
            } else {
                hsh_to_len[sum] = len(data)
            }
        }
    }
}

bdon commented 2 years ago

https://stackoverflow.com/questions/22029012/probability-of-64bit-hash-code-collisions

Given that the probability of a 64-bit collision with 4 billion unique tiles is 0.5 (assuming a perfectly distributed hash), and worst-case usage can have ~400 million unique tiles (imagine a tileset zooms 0-15 where every tile is unique), it would make sense to use 128-bit hashes if the additional cost is reasonable

bdon commented 2 years ago

Running the deduplication check over a big archive with 64 bit fnv1a takes 9 minutes; with 128 bit fnv1a it takes 14 minutes.

Memory costs double of course

nyurik commented 2 years ago

would it make sense to compare hash + length, and treat that as a unique-enough content?

bdon commented 2 years ago

@nyurik you will need 32 bits for the length, so that's not that different from just using a 96 bit hash; in practice not much different than 128

protomaps / PMTiles

Hash function collisions / hash width for tile deduplication #20